Gradient Accumulation

A technique for running or fitting large models on a not-so-powerful GPU.

Let’s say you want to use a batch size of 64, but the model doesn’t fit with that size on your GPU.

First determine the largest possible batch size that can fit on your GPU. Let’s say it’s 16. It may be better to use batch sizes that are a power of 2.
Calculate the gradients for \(X\) batches without updating the parameters.
- \(X\) is your desired batch size divided by the batch size you are using.
- Desired batch size is 64; batch size we are using is 16.
- \(64 ÷ 16 = 4\)
- \(X\) is 4. This is because the size of 4 batches, in this case, sums to 64.
Next, sum all respective gradients — hence the term ‘gradient accumulation’.
Now update your parameters based on these summed gradients. This will have the same effect as if you used a batch size of 64.

Note

Using a smaller batch size to fit a larger model onto your GPU isn’t optimal. A smaller batch size means you would have to tweak your optimal hyperparameters, such as the learning rate. Your loss would also become less accurate since it is being calculated on a smaller group of items.