Gradient Accumulation
A technique for running or fitting large models on a not-so-powerful GPU.
Let’s say you want to use a batch size of 64, but the model doesn’t fit with that size on your GPU.
- First determine the largest possible batch size that can fit on your GPU. Let’s say it’s 16. It may be better to use batch sizes that are a power of 2.
- Calculate the gradients for \(X\) batches without updating the parameters.
- \(X\) is your desired batch size divided by the batch size you are using.
- Desired batch size is 64; batch size we are using is 16.
- \(64 ÷ 16 = 4\)
- \(X\) is 4. This is because the size of 4 batches, in this case, sums to 64.
- Next, sum all respective gradients — hence the term ‘gradient accumulation’.
- Now update your parameters based on these summed gradients. This will have the same effect as if you used a batch size of 64.
Note
Using a smaller batch size to fit a larger model onto your GPU isn’t optimal. A smaller batch size means you would have to tweak your optimal hyperparameters, such as the learning rate. Your loss would also become less accurate since it is being calculated on a smaller group of items.