Content
This post is to sort out the optimization methods in deep learning from the most basic idea about what is gradient descent to the most popular optimizer - Adam. Moreoever, the SGD, SGDM, Adagrad, PMSProp and some recent advances such as AdamW are introduced. Hope this post could give me some hints about how to choose different optimizers to train the neural network.
Reference: Optimization slides, Tutorial.





