I just want to scribble down some of the things I have learned from my own experience in training deep neural networks. Hope this helps others too.
1. Optimizer: use SGD with momentum. If momentum is too high, you may experience validation error greater than train error even if it is not overfit. This is because for each epoch, the momentum starts with zero but builds up as more batches are trained, and at the end of epoch, you may experience gradient explosion, which leads to large validation error. Typical value of momentum is 0.9
2. Gradient clipping: always use gradient clipping to prevent gradient explosion. This saves a lot of time because you don't need to manually tune learning rate constantly while training. Typical value is of 10 or lower
3. Learning rate: in theory, as large as it can be, given that it is small enough to prevent gradient explosion. However, this is just too much of work to adjust learning rate during the training, so simply set it high enough and use gradient clipping to prevent gradient explosion. Typical value is 0.001 or lower
4. Input normalization: to facilitate training, normalize the input data to have zero-mean and unit standard deviation.
5. Batch normalization: employ batch normalization layers. These layers are especially very helpful for deep-networks.
6. Drop out: although drop out is not needed when batch norm is employed, one can still employ small dropout (~0.1) for multiple layers. I think this is better than one or two large dropout (~0.5). If data size is small compared to network, and one needs extra measure to prevent overfitting, drop out layers are useful
7. L2 weight decay: not necessary, but still useful as an option. Typical value of 1e-5 should be fine
8. Short-cuts: shortcuts are extremely useful for deep neural networks. Most popular implementation is perhaps residual blocks. Full pre-activation may be the best choice as illustrated here
9. The output size y of convolution given input size x is
y = (x - kernel + padding*2)/stride + 1
10. For 2D convolutions, using small-kernel convolutions many times is more beneficial than using one large-kernel convolution. For example, using 3 convolutions of 3x3 kernel having depth d results in 3^3 * d = 27d parameters, whereas 1 convolution of 7x7 kernel having the same depth d results in 7^2 * d = 49d parameters. Note that both in both cases the receptive size is 7x7, while with the former case, we can employ 3 activation layers, while the latter we can only get 1 activation layer. Therefore, it is usually believed that the former should be more effective in learning. However, for 1D convolution, the former case requires more parameters than the latter
11. Activation layers: typically ReLU layers are used, but ELU may be a good alternative. It is recommended to employ clipping on those unbounded activation layers. i.e., use y = clamp(x, min=0, max=5) in place of ReLU layers to prevent too large values
12. Training history: it is very important to save loss and accuracy history during the training for both training data and validation data. This is because the training history tells us a lot about it. Usually, in the beginning of the training the validation error should be less than training error, since the error is the running average for the training, while the validation error is the error at the end of the training stage in the epoch. However, as time goes by, the training error should be less than validation error, because it will be slightly overfit. That is a good time to either stop the training, to prevent overfitting, or take additional measure. Also, when the improvement flattens, it is a good indicator to lower LR.
I will continue to add more as I gain more experienced.
No comments:
Post a Comment