What makes a deep learning model such a powerful learning machine? If we think of it as a fancy car, there are two most critical components: One is the structure of neural network (like the car body), which provides an extrodinary complexibility and representation power; The other one is the optimization algorithm (like the car engine), which enables a model to learn from the complicated reality (as represented in the dataset).
In this article, we will first dissemble a popular engine, ADAptive Moment estimation (ADAM) optimizer, for deep learning model, and show what parts are there and how they worked together. Secondly, we will walk through the implementation of ADAM optimizer in pytorch and find out the important interfaces where we can test / check its working status.
Let's start with the Stochastic Gradient Descent (SGD) algorithm to update the weights in a neural network. Briefly speaking, the algorithm follows the logic below:
For each time step t:
for each mini-batch p:
W = W - f(dW, alpha)
where $dW = \partial L / \partial W$ is the derivative of the loss function $L$ w.r.t. the weights $W$, and $\alpha$ is a learning rate. For the ordinary SGD, $$f(dW, \alpha) = \alpha \cdot dW$$ For ADAM, $f(dW, \alpha)$ takes a slightly more complex form and we will walk through it next.
The Exponentially-Weighted-Average (EWA) of a time series $\{\theta_t\}_{t=1,2,...}$ is calculated as $$V_t = \beta V_{t-1} + (1-\beta) \theta_t$$ where $\beta$ is the weight, and $1 / (1-\beta)$ gives a rough estimation of how mnay terms over which $V_t$ is averaged. For instance, if $\beta=0.9$, $V_t$ is roughly the average of the last 10 terms as $1 / (1-0.9) = 10$.
It is worth mentioning that at the first few steps, $V_t$ is smaller than the average value. Let's assume $\beta = 0.9$, and see what is happening: $$ V_0 = 0,\\ V_1 = 0.9V_0 + 0.1\theta_1 = 0.1\theta_1, \\ V_2 = 0.9V_1 + 0.1\theta_2 = 0.1\theta_1 + 0.1\theta_2, \\ ... $$ So sometimes people will introduce a bias-correction to correct this under-estimation: $$V_t = \frac{\beta V_{t-1} + (1-\beta) \theta_t}{1 - \beta^t} $$ Since $\beta < 1$, $\beta^t$ is close to one when $t$ is small but decays into zero when $t$ becomes large.
We can replace the derivative term in SGD with an exponentially-weighted-average as the following: $$f(dW, \alpha)_t = \alpha \cdot V_t = \alpha (\beta V_{t-1} + (1-\beta) dW_t)$$ If $dW$ is changing too rapidly with time, this momentum expression will damp out the noise by taking the temporal average.
The SGD with Root-Mean-Squared Prop (RMSProp) also involves an EWA: $$S_t = \beta S_{t-1} + (1-\beta) dW^2_t$$ And the weighted is updated in the following way: $$f(dW, \alpha)_t = \alpha \frac{dW_t}{\sqrt{S_t}}$$
If we put momentum and the RMSprop together, we can get the very popular ADAM optimizer: $$f(dW, \alpha)_t = \alpha \frac{V_t}{\sqrt{S_t} + \epsilon}$$ where $$V_t = \frac{\beta_1 V_{t-1} + (1-\beta_1) dW_t}{1 - \beta_1^t} $$ and $$S_t = \frac{\beta_2 S_{t-1} + (1-\beta_2) dW^2_t}{1 - \beta_2^t}$$ $\epsilon$ is added to avoid a zero denominator.
Here, we have at least three hyper-parameters: the learning rate $\alpha$, the weight for momentum term $\beta_1$ and the weight for the RMSProp term $\beta_2$. In practice, people usually leave $\beta_1=0.9$ and $\beta_2=0.999$ and tune $\alpha$ in a range of values.
Many common optimizers, including ADAM, have been implemented in the package torch.optim
. To use a optimizer, it takes the following steps:
Let's take a look at a piece of sample code (from [1]):
optimizer = optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
Here we construct a SGD optimizer object, and set the learning rate of 0.01 and a momentum coefficnet 0.9 for all the parameters. For model.classifier.parameters, we use a different learning rate of 0.001.
In the training process, the optimizer updates the weights at every time step, for every mini-batch of the dataset.
for input, target in dataset:
optimizer.zero_grad() # Clear the gradient
output = model(input) # Evaluate the model output y
loss = loss_fn(output, target) # Evaluate the loss function
loss.backward() # Back-propagate the error and calculate dW
optimizer.step() # Update the weights using dW
As we might have expected, an optimization algorithm would require some necessary inputs, including the model architecture, the loss function, and the training / testing data. However, if we take a closer look at the construcutor of an optimizer class in pytorch (taking ADAM for example), the only hook with the model is through the model parameters which are to be optimized.
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999),
eps=1e-08, weight_decay=0,
amsgrad=False)
This makes sense because the forward calculation is handled by model class and the loss function, and the $dW$ from back-propagation is already computed and stored in the model parameter. The following code is from the step() function in torch.optim.Adam
. The variable p
and p.grad
have the parameters (the weights) and their gradients. Therefore, the optimizer does not work with the model class directly; instead, it takes the model parameters and their gradient as pure matrix values and perform the update in every iteration.
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad
A tip for debugging: If we want to check if an optimizer is working properly or not, we can check (1) Their gradients are calculated correctly; (2) the model parameter are updated properly.
In this post, we discovered ADAM optimizer for deep learning. We learned how ADAM is different from stochastic gradient descent and gained a better understanding why it is a widely-used as an effective algorithm for deep learning. In addition, we discovered the implementation detail of ADAM optimizer in pytorch. We learned how to use and test it by checking the key variables in the optimizer class.
[REFERENCES]:
[1] Pytorch documents on optimizer: https://pytorch.org/docs/stable/optim.html.