This notebook follows the fastai style guide.
In this notebook, I will implement a neural network from scratch, and iteratively reimplement with PyTorch. That is, I will implement each element of the training and inference process from scratch, before then using the corresponding element in PyTorch. This notebook assumes a prior understanding of the flow and pieces of a neural network.
To recap, the complete training loop of a neural network looks like this.
This notebook also serves to show the modular nature of PyTorch.
Let’s get started with some data.
The goal of our model will be to classify digits from the MNIST dataset.
from pathlib import Path
MNIST_URL = 'https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/data/mnist.pkl.gz?raw=true'
d_path = Path('data')
d_path.mkdir(exist_ok=True)
d_path = d_path/'mnist.pkl.gz'
from urllib.request import urlretrieve
if not d_path.exists(): urlretrieve(MNIST_URL, d_path)
! ls -l data
total 33312
-rw-r--r-- 1 salmannaqvi staff 17051982 May 12 12:37 mnist.pkl.gz
import gzip, pickle
from torch import tensor
with gzip.open(d_path, 'rb') as f: ((trn_x, trn_y), (vld_x, vld_y), _) = pickle.load(f, encoding='latin-1')
1trn_x, trn_y, vld_x, vld_y = map(tensor, [trn_x[:1000], trn_y[:1000], vld_x[:1000], vld_y[:1000]])
A neuron comprises of a set of weights, the linear function, and the activation function.
Our dataset contains one thousand 28x28 pixel samples. Therefore, each sample has 28x28=784 inputs. Since we will be classifying digits, there will be 10 outputs–a probablity for each digit.
n, m = trn_x.shape
c = trn_y.max() + 1
n, m, c
(1000, 784, tensor(10))
Let’s have 50 neurons comprise the hidden layer.
nh = 50
From these dimensions, we can create our appropriate weights…
import torch; torch.set_printoptions(precision=2, linewidth=140, sci_mode=False)
w1, b1 = torch.randn(m, nh), torch.zeros(nh)
w2, b2 = torch.randn(nh, 1), torch.zeros(1)
w1.shape, b1.shape, w2.shape, b2.shape
(torch.Size([784, 50]), torch.Size([50]), torch.Size([50, 1]), torch.Size([1]))
…and create our linear model!
def lin(x, w, b): return x @ w + b
t = lin(vld_x, w1, b1); t.shape
torch.Size([1000, 50])
vld_x.shape, w1.shape
(torch.Size([1000, 784]), torch.Size([784, 50]))
from fastcore.all import *
import torch.nn.functional as F
test_eq(lin(vld_x, w1, b1), F.linear(vld_x, w1.T, b1))
Our implementation produces the same outputs as PyTorch’s implementation.
We now need to implement the activation function, which will be the ReLU (rectified linear unit). Any value less than 0 gets clipped to 0. There are multiple ways we can approach doing this, such as using torch.max
.
?torch.max
Docstring:
max(input) -> Tensor
Returns the maximum value of all elements in the ``input`` tensor.
.. warning::
This function produces deterministic (sub)gradients unlike ``max(dim=0)``
Args:
input (Tensor): the input tensor.
Example::
>>> a = torch.randn(1, 3)
>>> a
tensor([[ 0.6763, 0.7445, -2.2369]])
>>> torch.max(a)
tensor(0.7445)
.. function:: max(input, dim, keepdim=False, *, out=None) -> (Tensor, LongTensor)
:noindex:
Returns a namedtuple ``(values, indices)`` where ``values`` is the maximum
value of each row of the :attr:`input` tensor in the given dimension
:attr:`dim`. And ``indices`` is the index location of each maximum value found
(argmax).
If ``keepdim`` is ``True``, the output tensors are of the same size
as ``input`` except in the dimension ``dim`` where they are of size 1.
Otherwise, ``dim`` is squeezed (see :func:`torch.squeeze`), resulting
in the output tensors having 1 fewer dimension than ``input``.
.. note:: If there are multiple maximal values in a reduced row then
the indices of the first maximal value are returned.
Args:
input (Tensor): the input tensor.
dim (int): the dimension to reduce.
keepdim (bool): whether the output tensor has :attr:`dim` retained or not. Default: ``False``.
Keyword args:
out (tuple, optional): the result tuple of two output tensors (max, max_indices)
Example::
>>> a = torch.randn(4, 4)
>>> a
tensor([[-1.2360, -0.2942, -0.1222, 0.8475],
[ 1.1949, -1.1127, -2.2379, -0.6702],
[ 1.5717, -0.9207, 0.1297, -1.8768],
[-0.6172, 1.0036, -0.6060, -0.2432]])
>>> torch.max(a, 1)
torch.return_types.max(values=tensor([0.8475, 1.1949, 1.5717, 1.0036]), indices=tensor([3, 0, 0, 1]))
.. function:: max(input, other, *, out=None) -> Tensor
:noindex:
See :func:`torch.maximum`.
Type: builtin_function_or_method
torch.max(tensor([-5, 2, 3, -4]), tensor([0]))
tensor([0, 2, 3, 0])
def relu(x): return torch.max(x, tensor([0]))
Another way is to use torch.clamp_min
, which is more idiomatic for this case.
def relu(x): return x.clamp_min(0.)
t = lin(vld_x, w1, b1)
test_eq(relu(t), F.relu(t))
A single neuron can now be constructed.
def model(xb):
l1 = relu(lin(xb, w1, b1))
return lin(l1, w2, b2)
res = model(vld_x); res.shape
torch.Size([1000, 1])
With the forward pass being implemented, it is time to determine the loss. Even though we have a multi-class classification problem at hand, I will use mean squared error for simplicity. Later in this post, I will switch to cross entropy loss.
The Mean Squared Error (MSE) between two vectors can be represented as:
where and are vectors of length , and and represent the -th elements of the vectors.
MSE in its most basic form looks like this.
If we have multiple data points, then it looks like this.
The tensor holding the predictions and the tensor holding the targets have different shapes. Therefore, there are different ways in which both can be subtracted from each other.
res.shape, vld_y.shape
(torch.Size([1000, 1]), torch.Size([1000]))
(vld_y - res).shape
torch.Size([1000, 1000])
(vld_y[:, None] - res).shape
torch.Size([1000, 1])
res[:, 0].shape, res.squeeze().shape
(torch.Size([1000]), torch.Size([1000]))
(vld_y - res[:, 0]).shape
torch.Size([1000])
However, it will be better to add a column to vld_y
rather than remove a column from res
, so as to keep the shape of all tensors consistent (i.e., all tensors having a row and column, as opposed to some having rows and columns, and others having only a column).
((vld_y[:, None] - res)**2).sum() / res.shape[0]
tensor(717.17)
def mse(preds, targs): return (targs[:, None] - preds).pow(2).mean()
preds = model(trn_x); mse(preds, trn_y)
tensor(648.87)
test_eq(mse(preds, trn_y), F.mse_loss(preds, trn_y[:, None]))
Now comes the backward pass; the pass responsible for computing the gradients of our model’s weights.
For brevity, I will not explain why I compute the gradients the way I do. It can be taken that the way I compute them is due to the result of calculating the derivatives of the foward pass by hand. If you would like to explore how I did so, you can refer to my other blog post, Backpropagation Explained using English Words*.
In short, the derivatives compute to be the following.
When implementing backpropagation, it is better to implement the entire equation in pieces, by storing the result of each intermediate gradient. These intermediate gradients can then be reused to calculate the gradients of another variable.
Let’s prepare the pieces we’ll need and get started.
l1 = relu(lin(trn_x, w1, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y); loss
tensor(648.87)
w1
GradientsThis is the maths to compute the gradients for w1
, as also shown above.
Here, you can see the individual pieces I will compute to implement this equation.
diff = trn_y[:, None] - l2; diff.shape
torch.Size([1000, 1])
loss.g = (2/n) * diff; loss, loss.g.shape
(tensor(648.87), torch.Size([1000, 1]))
diff.g = -1 * loss.g; diff[:5], diff.shape
(tensor([[-15.34],
[-33.46],
[-35.26],
[ -6.92],
[-21.55]]),
torch.Size([1000, 1]))
(w2.shape, diff.g.shape), (w2.T.shape, diff.g[:, None].shape)
((torch.Size([50, 1]), torch.Size([1000, 1])),
(torch.Size([1, 50]), torch.Size([1000, 1, 1])))
(diff.g @ w2.T).shape
torch.Size([1000, 50])
l2.g = diff.g @ w2.T; l2.g.shape
torch.Size([1000, 50])
(l1 > 0).float()
tensor([[0., 1., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 1., 0., 0.],
[1., 1., 1., ..., 0., 0., 1.],
...,
[0., 0., 0., ..., 0., 1., 0.],
[1., 1., 0., ..., 0., 0., 1.],
[0., 0., 1., ..., 0., 0., 0.]])
l1.g = l2.g * (l1 > 0).float(); l1.g.shape
torch.Size([1000, 50])
(l1.g.shape, trn_x.shape), (l1.g[:, None, :].shape, trn_x[..., None].shape)
((torch.Size([1000, 50]), torch.Size([1000, 784])),
(torch.Size([1000, 1, 50]), torch.Size([1000, 784, 1])))
w1.g = tensor([1, 2])
w1.g = (l1.g[:, None, :] * trn_x[..., None]).sum(0); w1.g.shape
torch.Size([784, 50])
(w1.shape, w1.g.shape), (w1.g.min(), w1.g.max())
((torch.Size([784, 50]), torch.Size([784, 50])),
(tensor(-17.50), tensor(25.09)))
Let’s verify our derivation is correct by comparing it to the gradients computed by PyTorch.
w1_ = w1.clone().requires_grad_();
l1 = relu(lin(trn_x, w1_, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, trn_y)
loss.backward()
w1_.grad
tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
(w1.g.min(), w1.g.max()), (w1_.grad.min(), w1_.grad.max())
((tensor(-17.50), tensor(25.09)), (tensor(-17.50), tensor(25.09)))
test_close(w1.g, w1_.grad, eps=0.01)
It is!
b1
GradientsAs previously mentioned, I can reuse the computed gradients to calculate the gradients for . For now though, I will show the entire implemention for easy reference and later, when we will encapsulate the backward pass, I will reuse the already computed gradients.
diff = trn_y[:, None] - l2
loss.g = (2/n) * diff
diff.g = loss.g * -1
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
l1.g.shape, b1.shape
(torch.Size([1000, 50]), torch.Size([50]))
b1.g = (l1.g * 1).sum(0); b1.g.shape
torch.Size([50])
b1.min(), b1.max()
(tensor(0.), tensor(0.))
trn_x
Gradientsdiff = trn_y[:, None] - l2
loss.g = (2/n) * diff
diff.g = loss.g * -1
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
l1.g.shape, w1.shape
(torch.Size([1000, 50]), torch.Size([784, 50]))
trn_x.g = l1.g @ w1.T
trn_x.g.min(), trn_x.g.max()
(tensor(-2.85, grad_fn=<MinBackward1>), tensor(2.85, grad_fn=<MaxBackward1>))
w1
Gradientsdiff = trn_y[:, None] - l2
loss.g = (2/n) * diff
diff.g = loss.g * -1
diff.g.shape, l1.shape
(torch.Size([1000, 1]), torch.Size([1000, 50]))
(diff.g * l1).sum(0, keepdim=True).T.shape
torch.Size([50, 1])
(diff.g[:, None, :] * l1[..., None]).sum(0).shape
torch.Size([50, 1])
w2.g = (diff.g[:, None, :] * l1[..., None]).sum(0); w2.g.shape
torch.Size([50, 1])
w2.g.min(), w2.g.max()
(tensor(8.37, grad_fn=<MinBackward1>), tensor(388.44, grad_fn=<MaxBackward1>))
b2
Gradientsdiff = trn_y[:, None] - l2
loss.g = (2/n) * diff
diff.g = loss.g * -1
b2.g = (diff.g * 1).sum(0)
b2.g.shape, b2.shape
(torch.Size([1]), torch.Size([1]))
Let’s verify our remaining gradients.
w1_, b1_, w2_, b2_, trn_x_ = [lambda w: w.clone.requires_grad_() for w in [w1, b1, w2, b2, trn_x]]
The expression above does not work to create copies. Rather than returning a cloned copy that requires gradients, lambda objects will be returned.
w1_, b1_, w2_, b2_, trn_x_ = map(lambda w: w.clone().requires_grad_(), [w1, b1, w2, b2, trn_x])
tensor([[-2.81, -1.72, -0.97, ..., -0.29, -1.62, -0.45],
[-1.77, -0.17, 1.32, ..., -0.92, 0.76, 2.77],
[ 0.58, 2.13, -0.98, ..., 0.41, 1.50, 0.86],
...,
[-0.50, -1.90, -0.10, ..., -1.61, 0.78, -0.09],
[ 0.89, 0.50, 1.21, ..., 0.93, -0.37, -0.85],
[ 0.57, -0.50, -1.47, ..., 0.72, 1.64, -0.85]], requires_grad=True)
w1_
tensor([[-2.81, -1.72, -0.97, ..., -0.29, -1.62, -0.45],
[-1.77, -0.17, 1.32, ..., -0.92, 0.76, 2.77],
[ 0.58, 2.13, -0.98, ..., 0.41, 1.50, 0.86],
...,
[-0.50, -1.90, -0.10, ..., -1.61, 0.78, -0.09],
[ 0.89, 0.50, 1.21, ..., 0.93, -0.37, -0.85],
[ 0.57, -0.50, -1.47, ..., 0.72, 1.64, -0.85]], requires_grad=True)
l1 = relu(lin(trn_x_, w1_, b1_))
l2 = lin(l1, w2_, b2_)
loss = mse(l2, trn_y)
loss.backward()
for a, b in zip((w1, b1, w2, b2, trn_x), (w1_, b1_, w2_, b2_, trn_x_)): test_close(a.g, b.grad, eps=1e-2)
All comparisons passed!
Now that we have the forward and backward passes sorted, let us cohesively bring them together.
def forward(inps, targs):
l1 = relu(lin(inps, w1, b1))
l2 = lin(l1, w2, b2)
loss = mse(l2, targs)
return l1, l2, loss
def backward(inps, targs, l1, l2, loss):
diff = targs[:, None] - l2
loss.g = (2 / n) * diff
diff.g = loss.g * -1
w2.g = (diff.g[:, None, :] * l1[..., None]).sum(0)
b2.g = (diff.g * 1).sum(0)
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
w1.g = (l1.g[:, None, :] * trn_x[..., None]).sum(0)
b1.g = (l1.g * 1).sum(0)
inps.g = l1.g @ w1.T
l1, l2, loss = forward(trn_x, trn_y)
backward(trn_x, trn_y, l1, l2, loss)
def comp_grads(*ws):
for a, b in zip(ws, (w1_, b1_, w2_, b2_, trn_x_)): test_close(a.g, b.grad, eps=1e-2)
comp_grads(w1, b1, w2, b2, trn_x)
The backward
function can be further refactored by taking the gradient computations of the linear layers common.
def backward(inps, targs, l1, l2, loss):
diff = targs[:, None] - l2
loss.g = (2/n) * diff
diff.g = loss.g * -1
lin_grad(l1, diff, w2, b2)
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
lin_grad(inps, l1, w1, b1)
def lin_grad(inp, out, w, b):
inp.g = out.g @ w.T
w.g = (out.g[:, None, :] * inp[..., None]).sum(0)
b.g = (out.g * 1).sum(0)
Previous implementation.
def backward(inps, targs, l1, l2, loss):
diff = targs[:, None] - l2
loss.g = (2 / n) * diff
diff.g = loss.g * -1
w2.g = (diff.g[:, None, :] * l1[..., None]).sum(0)
b2.g = (diff.g * 1).sum(0)
l2.g = diff.g @ w2.T
l1.g = l2.g * (l1 > 0).float()
w1.g = (l1.g[:, None, :] * trn_x[..., None]).sum(0)
b1.g = (l1.g * 1).sum(0)
inps.g = l1.g @ w1.T
backward(trn_x, trn_y, *forward(trn_x, trn_y))
comp_grads(w1, b1, w2, b2, trn_x)
Currently, we have functions that each separately handle a part of the network. For instance, mse
only computes its respective portion of the forward pass: the mean squared error. backward
is a separate function that handles the backward pass for all pieces of the network.
Let us change how this works, so each piece of the network also handles its respective backward pass. This means, mse
will have the ability to compute both its forward pass and backward pass.
class MSE:
def __call__(self, inp, targs):
self.inp,self.targs = inp,targs
self.out = (inp[:, 0] - targs).pow(2).mean()
return self.out
def backward(self): self.inp.g = (2 / self.inp.shape[0]) * (self.inp[:, 0] - self.targs)[..., None]
test_eq(MSE()(preds, trn_y), mse(preds, trn_y))
class Lin:
def __init__(self, w, b): self.w,self.b = w,b
def __call__(self, inp):
self.inp = inp
self.out = self.inp @ self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.T
self.w.g = (self.out.g[:, None, :] * self.inp[..., None]).sum(0)
self.b.g = self.out.g.sum(0)
test_eq(Lin(w1, b1)(trn_x), lin(trn_x, w1, b1))
class ReLU:
def __call__(self, inp):
self.inp = inp
self.out = self.inp.clamp_min(0.)
return self.out
def backward(self): self.inp.g = self.out.g * (self.inp > 0).float()
test_eq(ReLU()(l1), relu(l1))
class Model:
def __init__(self, w1, b1, w2, b2):
self.layers = [Lin(w1, b1), ReLU(), Lin(w2, b2)]
self.loss = MSE()
def __call__(self, inp, targs):
for l in self.layers: inp = l(inp)
return self.loss(inp, targs)
def backward(self):
self.loss.backward()
for l in self.layers[::-1]: l.backward()
model = Model(w1, b1, w2, b2)
l = model(trn_x, trn_y)
model.backward()
comp_grads(w1, b1, w2, b2, trn_x)
The classes we have created have common functionality, meaning their is still room for further refactoring. In particular, all the classes store the forward pass arguments as attributes if needed, have a __call__
dunder method that exectutes the forward pass, and a backward
method for the backward pass.
class Module():
def __call__(self, *args):
self.args = args
self.out = self.forward(*args)
return self.out
def forward(self): raise Exception('Forward pass not implemented')
def backward(self): self.bwd(self.out, *self.args)
def bwd(self): raise Exception('Backward pass not implemented.')
class MSE(Module):
def forward(self, inp, targs): return (inp[:, 0] - targs).pow(2).mean()
def bwd(self, out, inp, targs): inp.g = (2 / inp.shape[0]) * (inp[:, 0] - targs)[..., None]
test_eq(MSE()(preds, trn_y), mse(preds, trn_y))
class Lin(Module):
def __init__(self, w, b): self.w,self.b = w,b
def forward(self, inp): return inp @ self.w + self.b
def bwd(self, out, inp):
inp.g = out.g @ self.w.T
self.w.g = (out.g[:, None, :] * inp[..., None]).sum(0)
self.b.g = out.g.sum(0)
test_eq(Lin(w1, b1)(trn_x), lin(trn_x, w1, b1))
class ReLU(Module):
def forward(self, inp): return inp.clamp_min(0.)
def bwd(self, out, inp): inp.g = out.g * (inp > 0).float()
test_eq(ReLU()(l1), relu(l1))
model = Model(w1, b1, w2, b2)
loss = model(trn_x, trn_y)
model.backward()
comp_grads(w1, b1, w2, b2)
And with that, this is the basic underlying paradigm in which PyTorch implements its components.
So let us now directly use PyTorch’s nn.Module
to handle our components. There is an added benefit that nn.Module
automatically keeps track of our gradients, so we do not need to implement the backward pass.
nn.Module
w1.shape, n, m, c, b1.shape
(torch.Size([784, 50]), 1000, 784, tensor(10), torch.Size([50]))
from torch import nn
class Linear(nn.Module):
def __init__(self, n_inps, n_outs):
super().__init__()
self.w = torch.randn(n_inps, n_outs).requires_grad_()
self.b = torch.randn(n_outs).requires_grad_()
def forward(self, inp): return inp @ self.w + self.b
F = nn.functional
class Model(nn.Module):
def __init__(self, n_inp, nh, n_out):
super().__init__()
self.layers = [Linear(n_inp, nh), nn.ReLU(), Linear(nh, n_out)]
def __call__(self, inp, targ):
for l in self.layers: inp = l(inp)
return F.mse_loss(inp, targ[:, None])
model = Model(m, nh, 1)
loss = model(trn_x, trn_y.float())
loss.backward()
model.layers
[Linear(), ReLU(), Linear()]
l0 = model.layers[0]; l0.b.grad
tensor([ 42.11, -25.91, 0.15, 15.73, -16.16, 41.61, 13.73, 81.32, -8.91, 55.30, -14.12, -82.24, 12.02, -27.58, -9.48, -90.85,
-25.55, 34.89, -0.68, -14.24, 4.73, 49.70, -27.02, 19.55, 10.14, 38.86, 30.55, 74.17, 2.15, -2.62, -37.11, 14.04,
-12.12, 0.89, -0.99, -6.29, -1.15, 12.26, -9.73, -4.13, -1.53, 1.67, 1.34, -9.78, 20.50, 7.30, 62.45, 5.94,
-3.28, -18.14])
Let’s now implement a much more appropriate loss function for our multi-target problem: cross entropy loss.
Model
, but without with loss function.class Model(nn.Module):
def __init__(self, n_inps, nh, n_outs):
super().__init__()
self.layers = [nn.Linear(n_inps, nh), nn.ReLU(), nn.Linear(nh, n_outs)]
def __call__(self, x):
for l in self.layers: x = l(x)
return x
model = Model(m, nh, c)
preds = model(trn_x); preds.shape
torch.Size([1000, 10])
As I have defined here, cross entropy loss simply involves taking the logarithm of the softmax function, and multiplying the results with the one hot encoded targets.
Softmax, a multi-class generalization of the sigmoid function, involves taking the exponent of each prediction, and dividing each resulting value with the sum of all predictions to the exponent.
Let’s begin by first taking the logarithm of the softmax function.
def log_softmax(x): return ((x.exp() / x.exp().sum(-1, keepdim=True))).log()
log_softmax(preds)
tensor([[-2.40, -2.33, -2.25, ..., -2.33, -2.40, -2.34],
[-2.37, -2.44, -2.21, ..., -2.30, -2.34, -2.28],
[-2.37, -2.45, -2.16, ..., -2.24, -2.40, -2.40],
...,
[-2.36, -2.45, -2.20, ..., -2.24, -2.39, -2.37],
[-2.34, -2.41, -2.28, ..., -2.20, -2.53, -2.25],
[-2.43, -2.37, -2.21, ..., -2.26, -2.40, -2.37]], grad_fn=<LogBackward0>)
F.log_softmax(preds, dim=-1)
tensor([[-2.40, -2.33, -2.25, ..., -2.33, -2.40, -2.34],
[-2.37, -2.44, -2.21, ..., -2.30, -2.34, -2.28],
[-2.37, -2.45, -2.16, ..., -2.24, -2.40, -2.40],
...,
[-2.36, -2.45, -2.20, ..., -2.24, -2.39, -2.37],
[-2.34, -2.41, -2.28, ..., -2.20, -2.53, -2.25],
[-2.43, -2.37, -2.21, ..., -2.26, -2.40, -2.37]], grad_fn=<LogSoftmaxBackward0>)
test_close(log_softmax(preds).detach(), F.log_softmax(preds, dim=-1).detach())
Our implementation involves division. According to the rule, , we can simplify our computation by subtracting the numerators and denominators instead.
def log_softmax(x): return x.exp().log() - x.exp().sum(-1, keepdim=True).log()
log_softmax(preds)
tensor([[-2.40, -2.33, -2.25, ..., -2.33, -2.40, -2.34],
[-2.37, -2.44, -2.21, ..., -2.30, -2.34, -2.28],
[-2.37, -2.45, -2.16, ..., -2.24, -2.40, -2.40],
...,
[-2.36, -2.45, -2.20, ..., -2.24, -2.39, -2.37],
[-2.34, -2.41, -2.28, ..., -2.20, -2.53, -2.25],
[-2.43, -2.37, -2.21, ..., -2.26, -2.40, -2.37]], grad_fn=<SubBackward0>)
Our implementation has an issue though: it is unstable. Anything involving exponents is inherently unstable. Have a large enough value, and we converge to infinity relatively quickly.
for x in range(0, 101, 10): print(f'e^{x}={torch.exp(tensor(x))}')
e^0=1.0
e^10=22026.46484375
e^20=485165184.0
e^30=10686474223616.0
e^40=2.353852703404196e+17
e^50=5.184705457665547e+21
e^60=1.1420073962419164e+26
e^70=2.515438700355918e+30
e^80=5.540622484676759e+34
e^90=inf
e^100=inf
Fortunately, there is trick to overcoming this known as the LogSumExp simplification.
is the largest element in .
To begin, we need to get the largest value in each sample.
max = preds.max(-1)[0]; max.shape, preds.shape
(torch.Size([1000]), torch.Size([1000, 10]))
?torch.max
Docstring:
max(input) -> Tensor
Returns the maximum value of all elements in the ``input`` tensor.
.. warning::
This function produces deterministic (sub)gradients unlike ``max(dim=0)``
Args:
input (Tensor): the input tensor.
Example::
>>> a = torch.randn(1, 3)
>>> a
tensor([[ 0.6763, 0.7445, -2.2369]])
>>> torch.max(a)
tensor(0.7445)
.. function:: max(input, dim, keepdim=False, *, out=None) -> (Tensor, LongTensor)
:noindex:
Returns a namedtuple ``(values, indices)`` where ``values`` is the maximum
value of each row of the :attr:`input` tensor in the given dimension
:attr:`dim`. And ``indices`` is the index location of each maximum value found
(argmax).
If ``keepdim`` is ``True``, the output tensors are of the same size
as ``input`` except in the dimension ``dim`` where they are of size 1.
Otherwise, ``dim`` is squeezed (see :func:`torch.squeeze`), resulting
in the output tensors having 1 fewer dimension than ``input``.
.. note:: If there are multiple maximal values in a reduced row then
the indices of the first maximal value are returned.
Args:
input (Tensor): the input tensor.
dim (int): the dimension to reduce.
keepdim (bool): whether the output tensor has :attr:`dim` retained or not. Default: ``False``.
Keyword args:
out (tuple, optional): the result tuple of two output tensors (max, max_indices)
Example::
>>> a = torch.randn(4, 4)
>>> a
tensor([[-1.2360, -0.2942, -0.1222, 0.8475],
[ 1.1949, -1.1127, -2.2379, -0.6702],
[ 1.5717, -0.9207, 0.1297, -1.8768],
[-0.6172, 1.0036, -0.6060, -0.2432]])
>>> torch.max(a, 1)
torch.return_types.max(values=tensor([0.8475, 1.1949, 1.5717, 1.0036]), indices=tensor([3, 0, 0, 1]))
.. function:: max(input, other, *, out=None) -> Tensor
:noindex:
See :func:`torch.maximum`.
Type: builtin_function_or_method
Then we can simply implement the rest of the algorithm.
(preds - max[..., None]).shape
torch.Size([1000, 10])
# Output hidden to prevent endless scrolling.
max[..., None] + (preds - max[..., None]).exp().sum(-1, keepdim=True).log()
test_close(torch.exp(preds).sum(-1, keepdim=True).log(), max[..., None] + (preds - max[..., None]).exp().sum(-1, keepdim=True).log())
def logsumexp(x):
max = x.max(-1)[0]
return max[..., None] + (preds - max[..., None]).exp().sum(-1, keepdim=True).log()
logsumexp(preds).shape
torch.Size([1000, 1])
test_close(logsumexp(preds), preds.logsumexp(-1)[..., None])
Let’s compare how quicker our new implemenation is compared to the previous one.
%timeit log_softmax(preds)
337 µs ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
def log_softmax(x): return x - logsumexp(x)
log_softmax(preds).shape
torch.Size([1000, 10])
%timeit log_softmax(preds)
190 µs ± 56 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Much faster!
All that is left now is to multiply our softmax predictions with the one hot encoded targets, and sum the resulting vector. However, due to the nature of our targets, we can employ a nifty trick that removes the need to create a tensor of one hot encoded targets: integer array indexing.
t = tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]); t
tensor([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
A fancy name for a simple concept, integer array indexing allows one to access elements in a tensor by simply specifing lists of indices.
t[[0, 1, 2], [0, 1, 2]]
tensor([1, 5, 9])
It is best to think of the tensor as a grid of coordinates, with the first coordinate representing the row, and the second coordinate representing the column. Elements 1, 5, and 9 are at (0, 0), (1, 1), and (2, 2).
1, 6, and 8 are at (0, 0), (1, 2), and (2, 1)
t[[0, 1, 2], [0, 2, 1]]
tensor([1, 6, 8])
3 and 8 are at (0, 2) and (2, 1).
t[[0, 2], [2, 1]]
tensor([3, 8])
Our targets consist of the integers from 0 to 9. Each row, or sample, in our predictions tensor represents a set of probabilites for each target.
This means we can directly access the prediction for the correct target through integer array indexing.
trn_y[:3]
tensor([5, 0, 4])
The targets for the first three samples are 5, 0, and, 4. Instead of manually specifying the targets when obtaining the predictions for the first three samples…
sm_preds = log_softmax(preds); sm_preds.shape
torch.Size([1000, 10])
sm_preds[0, 5], sm_preds[1, 0], sm_preds[2, 4]
(tensor(-2.27, grad_fn=<SelectBackward0>),
tensor(-2.37, grad_fn=<SelectBackward0>),
tensor(-2.26, grad_fn=<SelectBackward0>))
…we can use the targets themselves to directly obtain our predictions.
sm_preds[[0, 1, 2], trn_y[:3]]
tensor([-2.27, -2.37, -2.26], grad_fn=<IndexBackward0>)
And now, our implementation can be completed.
def nll(preds, targs): return -preds[range(targs.shape[0]), targs].mean()
loss = nll(sm_preds, trn_y); loss
tensor(2.30, grad_fn=<NegBackward0>)
1test_close(F.nll_loss(F.log_softmax(preds, -1), trn_y), loss, 1e-3)
F.cross_entropy
and F.nll_loss
is that the former expects the input to be the raw model outputs, where as the latter expects the input to already be logarithmic probabilities. It can be said that F.nll_loss
computes cross entropy loss by starting at an intemediary step.
Okay, now we have all the components of a machine that is the neural network:
It is time to get the machine up and running as a whole. It’s time to get the training loop looping.
loss_func = F.cross_entropy
bs = 50
xb = trn_x[0:bs]
preds = model(xb); preds[0], preds.shape
(tensor([-0.08, -0.01, 0.08, 0.11, -0.02, 0.06, 0.13, -0.00, -0.08, -0.01], grad_fn=<SelectBackward0>),
torch.Size([50, 10]))
yb = trn_y[:bs]; yb
tensor([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4, 0, 9, 1, 1, 2, 4, 3, 2, 7, 3, 8, 6, 9, 0, 5, 6, 0, 7, 6, 1, 8, 7, 9,
3, 9, 8, 5, 9, 3])
loss_func(preds, yb)
tensor(2.30, grad_fn=<NllLossBackward0>)
We’ll use accuracy as our metric.
preds.argmax(-1)
tensor([6, 2, 2, 2, 5, 2, 5, 2, 5, 2, 2, 2, 3, 2, 5, 5, 2, 2, 2, 5, 6, 3, 5, 2, 5, 2, 2, 3, 3, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 5, 2, 5, 5,
2, 2, 2, 2, 5, 5])
(preds.argmax(-1) == yb).sum()
tensor(5)
def accuracy(preds, yb): return ((preds.argmax(-1) == yb).sum()) / yb.shape[0]
accuracy(preds, yb)
tensor(0.10)
test_close(accuracy(preds, yb), (preds.argmax(-1) == yb).float().mean())
def report(loss, preds, yb): print(f'Loss: {loss:.2f}; Accuracy: {accuracy(preds, yb):.2f}')
lr, epochs = .5, 3
xb, yb = trn_x[:bs], trn_y[:bs]
preds = model(xb)
report(loss_func(preds, yb), preds, yb)
Loss: 2.30; Accuracy: 0.10
The training loop can now be assembled.
for epoch in range(epochs):
for i in range(0, n, bs):
s = slice(i, min(n, bs+i))
xb, yb = trn_x[s], trn_y[s]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
with torch.no_grad():
for l in model.layers:
if hasattr(l, 'weight'):
l.weight -= l.weight.grad * lr
l.bias -= l.bias.grad * lr
l.weight.grad.zero_()
l.bias .grad.zero_()
report(loss, preds, yb)
Loss: 1.01; Accuracy: 0.66
Loss: 0.45; Accuracy: 0.88
Loss: 0.37; Accuracy: 0.82
Let’s take a closer look at how we slice: s = slice(i, min(n, bs+i))
. We have to use min
to prevent the slices from going out of bounds.
?slice
Init signature: slice(self, /, *args, **kwargs)
Docstring:
slice(stop)
slice(start, stop[, step])
Create a slice object. This is used for extended slicing (e.g. a[0:10:2]).
Type: type
Subclasses:
for i in range(0, n, bs): print(slice(i, min(n, bs+i)))
slice(0, 50, None)
slice(50, 100, None)
slice(100, 150, None)
slice(150, 200, None)
slice(200, 250, None)
slice(250, 300, None)
slice(300, 350, None)
slice(350, 400, None)
slice(400, 450, None)
slice(450, 500, None)
slice(500, 550, None)
slice(550, 600, None)
slice(600, 650, None)
slice(650, 700, None)
slice(700, 750, None)
slice(750, 800, None)
slice(800, 850, None)
slice(850, 900, None)
slice(900, 950, None)
slice(950, 1000, None)
Simply adding bs
to n
at the end
parameter for range
will not work.
for i in range(0, n+bs, bs): print(slice(i, bs+i))
slice(0, 50, None)
slice(50, 100, None)
slice(100, 150, None)
slice(150, 200, None)
slice(200, 250, None)
slice(250, 300, None)
slice(300, 350, None)
slice(350, 400, None)
slice(400, 450, None)
slice(450, 500, None)
slice(500, 550, None)
slice(550, 600, None)
slice(600, 650, None)
slice(650, 700, None)
slice(700, 750, None)
slice(750, 800, None)
slice(800, 850, None)
slice(850, 900, None)
slice(900, 950, None)
slice(950, 1000, None)
slice(1000, 1050, None)
Currently, we update our weights by checking whether a layer in our network has a weight
attribute.
for epoch in range(epochs):
for i in range(0, n, bs):
s = slice(i, min(n, bs+i))
xb, yb = trn_x[s], trn_y[s]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
with torch.no_grad():
for l in model.layers:
if hasattr(l, 'weight'):
l.weight -= l.weight.grad * lr
l.bias -= l.bias.grad * lr
l.weight.grad.zero_()
l.bias .grad.zero_()
report(loss, preds, yb)
PyTorch actually keeps track which layers have weights. Let us explore.
Here, PyTorch knows that our model has a linear layer with 3 inputs and 4 outputs.
m1 = nn.Module()
m1.foo = nn.Linear(3, 4); m1
Module(
(foo): Linear(in_features=3, out_features=4, bias=True)
)
list(m1.named_children())
[('foo', Linear(in_features=3, out_features=4, bias=True))]
In a similar manner, we can access the layer’s parameters.
list(m1.foo.parameters())
[Parameter containing:
tensor([[-0.37, 0.20, -0.39],
[-0.47, 0.00, 0.18],
[ 0.51, -0.35, 0.36],
[ 0.12, 0.10, -0.03]], requires_grad=True),
Parameter containing:
tensor([ 0.31, -0.42, 0.35, 0.16], requires_grad=True)]
However, this approach will require us to loop through all layers to access all parameters. PyTorch instead provides a way to directly return the parameters of all layers.
list(m1.parameters())
[Parameter containing:
tensor([[-0.37, 0.20, -0.39],
[-0.47, 0.00, 0.18],
[ 0.51, -0.35, 0.36],
[ 0.12, 0.10, -0.03]], requires_grad=True),
Parameter containing:
tensor([ 0.31, -0.42, 0.35, 0.16], requires_grad=True)]
class MLP(nn.Module):
def __init__(self, n_inps, nh, n_outs):
super().__init__()
self.l1 = nn.Linear(n_inps, nh)
self.l2 = nn.Linear(nh, n_outs)
self.relu = nn.ReLU()
def forward(self, x): return self.l2(self.relu(self.l1(x)))
n, m, nh, c
(1000, 784, 50, tensor(10))
model = MLP(m, nh, c); model.l1
Linear(in_features=784, out_features=50, bias=True)
model
MLP(
(l1): Linear(in_features=784, out_features=50, bias=True)
(l2): Linear(in_features=50, out_features=10, bias=True)
(relu): ReLU()
)
for name, l in model.named_children(): print(f'{name}: {l}')
l1: Linear(in_features=784, out_features=50, bias=True)
l2: Linear(in_features=50, out_features=10, bias=True)
relu: ReLU()
for p in model.parameters(): print(p.shape)
torch.Size([50, 784])
torch.Size([50])
torch.Size([10, 50])
torch.Size([10])
Since we can directly access the parameters, we do not need to check whether a certain parameter exists.
def fit():
for epoch in range(epochs):
for i in range(0, n, bs):
s = slice(i, min(n, bs+i))
xb, yb = trn_x[s], trn_y[s]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
with torch.no_grad():
for p in model.parameters(): p -= p.grad * lr
1 model.zero_grad()
report(loss, preds, yb)
torch.zero_grad()
can also be called directly on the model itself.
fit()
Loss: 0.84; Accuracy: 0.74
Loss: 0.45; Accuracy: 0.88
Loss: 0.37; Accuracy: 0.84
Let us implement this functionality–where the model itself knows what its layers and parameters are–ourselves.
To do so, we will need to define the __setattr__
dunder method, where any submodules defined are registered as parameters of the model.
class MyModule:
def __init__(self, n_inps, nh, n_outs):
self._modules = {}
self.l1 = nn.Linear(n_inps, nh)
self.l2 = nn.Linear(nh, n_outs)
def __setattr__(self, k, v):
if not k.startswith('_'): self._modules[k] = v
1 super().__setattr__(k, v)
def __repr__(self): return f'{self._modules}'
def parameters(self):
for l in self._modules.values(): yield from l.parameters()
class MyModule
is actually class MyModule(object)
mdl = MyModule(m, nh, c); mdl, model
({'l1': Linear(in_features=784, out_features=50, bias=True), 'l2': Linear(in_features=50, out_features=10, bias=True)},
MLP(
(l1): Linear(in_features=784, out_features=50, bias=True)
(l2): Linear(in_features=50, out_features=10, bias=True)
(relu): ReLU()
))
for p in mdl.parameters(): print(p.shape)
torch.Size([50, 784])
torch.Size([50])
torch.Size([10, 50])
torch.Size([10])
To use our original approach, where a list of layers are specified, we can use the add_module
method provided by PyTorch.
?nn.Module.add_module
Signature:
nn.Module.add_module(
self,
name: str,
module: Optional[ForwardRef('Module')],
) -> None
Docstring:
Adds a child module to the current module.
The module can be accessed as an attribute using the given name.
Args:
name (str): name of the child module. The child module can be
accessed from this module using the given name
module (Module): child module to be added to the module.
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/nn/modules/module.py
Type: function
layers = [nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, c)]
from functools import reduce
class Model(nn.Module):
def __init__(self, layers):
super().__init__()
self.layers = layers
for i, l in enumerate(self.layers): self.add_module(f'layer_{i}', l)
1 def forward(self, x): return reduce(lambda val, layer: layer(val), self.layers, x)
reduce
uses the output of the function as input to the same function in the next iteration.
?reduce
Docstring:
reduce(function, iterable[, initial]) -> value
Apply a function of two arguments cumulatively to the items of a sequence
or iterable, from left to right, so as to reduce the iterable to a single
value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates
((((1+2)+3)+4)+5). If initial is present, it is placed before the items
of the iterable in the calculation, and serves as a default when the
iterable is empty.
Type: builtin_function_or_method
reduce(lambda x,y: x+y, [1, 2, 3, 4, 5])
15
model = Model(layers); model
Model(
(layer_0): Linear(in_features=784, out_features=50, bias=True)
(layer_1): ReLU()
(layer_2): Linear(in_features=50, out_features=10, bias=True)
)
model(xb).shape
torch.Size([50, 10])
Alternatively, nn.ModuleList
can do the registration for us.
?nn.ModuleList
Init signature:
nn.ModuleList(
modules: Optional[Iterable[torch.nn.modules.module.Module]] = None,
) -> None
Docstring:
Holds submodules in a list.
:class:`~torch.nn.ModuleList` can be indexed like a regular Python list, but
modules it contains are properly registered, and will be visible by all
:class:`~torch.nn.Module` methods.
Args:
modules (iterable, optional): an iterable of modules to add
Example::
class MyModule(nn.Module):
def __init__(self):
super().__init__()
self.linears = nn.ModuleList([nn.Linear(10, 10) for i in range(10)])
def forward(self, x):
# ModuleList can act as an iterable, or be indexed using ints
for i, l in enumerate(self.linears):
x = self.linears[i // 2](x) + l(x)
return x
Init docstring: Initializes internal Module state, shared by both nn.Module and ScriptModule.
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/nn/modules/container.py
Type: type
Subclasses: ParametrizationList
class SequentialModel(nn.Module):
def __init__(self, layers):
super().__init__()
self.layers = nn.ModuleList(layers)
def forward(self, x): return reduce(lambda x, layer: layer(x), self.layers, x)
model = SequentialModel(layers); model
SequentialModel(
(layers): ModuleList(
(0): Linear(in_features=784, out_features=50, bias=True)
(1): ReLU()
(2): Linear(in_features=50, out_features=10, bias=True)
)
)
fit()
Loss: 0.93; Accuracy: 0.78
Loss: 0.52; Accuracy: 0.86
Loss: 0.38; Accuracy: 0.86
model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, c)); model
Sequential(
(0): Linear(in_features=784, out_features=50, bias=True)
(1): ReLU()
(2): Linear(in_features=50, out_features=10, bias=True)
)
fit()
Loss: 0.88; Accuracy: 0.74
Loss: 0.48; Accuracy: 0.86
Loss: 0.39; Accuracy: 0.88
Optimizer is simply the name given to the algorithm that updates the weights.
class Optimizer:
def __init__(self, params, lr=0.5): self.params,self.lr = list(params), lr
def step(self):
with torch.no_grad():
for p in self.params: p -= p.grad * self.lr
def zero_grad(self):
for p in self.params: p.grad.data.zero_()
model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, c))
opt = Optimizer(model.parameters())
The weight update step can now be cleaned up by using opt.step()
and opt.zero_grad()
instead.
def fit():
for epoch in range(epochs):
for i in range(0, n, bs):
s = slice(i, min(n, i+bs))
xb, yb = trn_x[s], trn_y[s]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
with torch.no_grad():
for p in model.parameters(): p -= p.grad * lr
model.zero_grad()
report(loss, preds, yb)
def fit():
for epoch in range(epochs):
for i in range(0, n, bs):
s = slice(i, min(n, i+bs))
xb, yb = trn_x[s], trn_y[s]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
report(loss, preds, yb)
fit()
Loss: 0.89; Accuracy: 0.74
Loss: 0.51; Accuracy: 0.88
Loss: 0.41; Accuracy: 0.86
from torch import optim
def get_model():
model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, c))
return model, optim.SGD(model.parameters(), lr=lr)
model, opt = get_model()
loss_func(model(xb), yb)
tensor(2.32, grad_fn=<NllLossBackward0>)
fit()
Loss: 0.82; Accuracy: 0.78
Loss: 0.42; Accuracy: 0.90
Loss: 0.35; Accuracy: 0.86
I sometimes get confuzzled between the two terms, with regard to what each component actually does. The best way to think about these terms is that a dataset simply stores data in a massive warehouse, while a dataloader takes data from the dataset and tosses them into crates known as batches.
As it currently is, we iterate through our dataset by obtaining a slice object, and then slicing out some data to form a batch.
for i in range(0, n, bs):
s = slice(i, min(n, bs+i))
xb, yb = trn_x[s], trn_y[s]
We will now simplify how we approach this logic.
The first point of simplification is to create a single dataset that will return both a sample and its associated target, from a single index. This will prevent us from having to index into two separate tensors.
class Dataset():
def __init__(self, x, y): self.x, self.y = x, y
def __len__(self): return len(self.x)
def __getitem__(self, i): return self.x[i], self.y[i]
trn_ds, vld_ds = Dataset(trn_x, trn_y), Dataset(vld_x, vld_y)
assert len(trn_ds) == len(trn_x)
assert len(vld_ds) == len(vld_x)
xb, yb = trn_ds[0:5]
assert xb.shape == (5, 28*28)
assert yb.shape == (5,)
xb, yb
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
tensor([5, 0, 4, 1, 9]))
model, opt = get_model()
for epoch in range(epochs):
for i in range(0, n, bs):
xb, yb = trn_ds[i:min(n, bs+i)]
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
report(loss, preds, yb)
Loss: 1.19; Accuracy: 0.70
Loss: 0.50; Accuracy: 0.88
Loss: 0.34; Accuracy: 0.88
Let us now abstract away how the data from our datasets is loaded, by putting the logic that fetches data from the dataset…
for i in range(0, n, bs):
xb, yb = trn_ds[i:min(n,i+bs)]
...
…into a class that we can call a dataloader.
for xb, yb in train_dl:
...
class DataLoader():
def __init__(self, ds, bs): self.ds,self.bs = ds,bs
def __iter__(self):
for i in range(0, len(self.ds), self.bs): yield self.ds[i:min(len(self.ds), i+self.bs)]
trn_dl, vld_dl = DataLoader(trn_ds, bs), DataLoader(vld_ds, bs)
next(iter(vld_dl))
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
tensor([3, 8, 6, 9, 6, 4, 5, 3, 8, 4, 5, 2, 3, 8, 4, 8, 1, 5, 0, 5, 9, 7, 4, 1, 0, 3, 0, 6, 2, 9, 9, 4, 1, 3, 6, 8, 0, 7, 7, 6, 8, 9, 0, 3,
8, 3, 7, 7, 8, 4]))
xb, yb = next(iter(vld_dl)); xb.shape
torch.Size([50, 784])
import matplotlib.pyplot as plt
plt.imshow(xb[0].view(28, 28)); yb[0]
tensor(3)
def fit():
for epoch in range(epochs):
for xb, yb in trn_dl:
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
report(loss, preds, yb)
model, opt = get_model()
fit()
Loss: 0.79; Accuracy: 0.82
Loss: 0.49; Accuracy: 0.84
Loss: 0.30; Accuracy: 0.88
And just like that, we have abstracted our loading logic from three lines…
for i in range(0, n, bs):
s = slice(i, min(n, bs+i))
xb, yb = trn_x[s], trn_y[s]
...
…to a much more readable single line.
for xb, yb in trn_dl:
...
Sampling is the method by which the dataloader selects indices from the dataset to load. Sampling from the training set should be random (due to the nature of our data), but not for the validation set.
Therefore, we will need to create an additional class for the our dataloader; a component that tells the dataloader from which indices to load data from the dataset.
import random
?random.shuffle
Signature: random.shuffle(x, random=None)
Docstring:
Shuffle list x in place, and return None.
Optional argument random is a 0-argument function returning a
random float in [0.0, 1.0); if it is the default None, the
standard random.random will be used.
File: ~/mambaforge/envs/default/lib/python3.10/random.py
Type: method
class Sampler():
def __init__(self, ds, shuffle=False): self.n,self.shuffle = len(ds),shuffle
def __iter__(self):
res = list(range(self.n))
if self.shuffle: random.shuffle(res)
return iter(res)
ss = Sampler(trn_ds); ss
<__main__.Sampler at 0x150dddd80>
try: print(next(ss))
except: pass
This does not work because __iter__
is not being called. __iter__
only gets called when we wrap the class with iter()
.
try: print(next(iter(ss)))
except: pass
0
it = iter(ss); it
<list_iterator at 0x150996fe0>
for o in range(5): print(next(it))
0
1
2
3
4
The Sampler
currently returns a single index in each iteration. We need to change that so a number of indices (equal to our batch size) is returned in each iteration. We can do this through a fancy slicing function known as islice
.
from itertools import islice
?islice
Init signature: islice(self, /, *args, **kwargs)
Docstring:
islice(iterable, stop) --> islice object
islice(iterable, start, stop[, step]) --> islice object
Return an iterator whose next() method returns selected values from an
iterable. If start is specified, will skip all preceding elements;
otherwise, start defaults to zero. Step defaults to one. If
specified as another value, step determines how many values are
skipped between successive calls. Works like a slice() on a list
but returns an iterator.
Type: type
Subclasses:
iter
returns a single element from an iterable at a time. islice
is a type of iterator that returns elements from an iterable at a time. It is an, erm, iterative slice.
list(islice(ss, 5))
[0, 1, 2, 3, 4]
Let’s define an additional class that takes a sampler, and assembles its output into batches.
class BatchSampler:
def __init__(self, sampler, bs, drop_last=False): store_attr()
1 def __iter__(self): yield from chunked(iter(self.sampler), self.bs, drop_last=self.drop_last)
fastcore
’s chunked
function has the exact same functionality as islice
, but with some extra quality of life features. This includes being able to specify how many chunks, or slices, we want back (rather than the number of elements in a chunk), as well as being able to specify whether we would like to drop, or keep, chunks that are smaller than our specified chunk size. This latter option is what we will use–it will abstract away the min
check we use in our DataLoader
(self.ds[i:min(len(self.ds), i+self.bs)]
).
??chunked
Signature: chunked(it, chunk_sz=None, drop_last=False, n_chunks=None)
Source:
def chunked(it, chunk_sz=None, drop_last=False, n_chunks=None):
"Return batches from iterator `it` of size `chunk_sz` (or return `n_chunks` total)"
assert bool(chunk_sz) ^ bool(n_chunks)
if n_chunks: chunk_sz = max(math.ceil(len(it)/n_chunks), 1)
if not isinstance(it, Iterator): it = iter(it)
while True:
res = list(itertools.islice(it, chunk_sz))
if res and (len(res)==chunk_sz or not drop_last): yield res
if len(res)<chunk_sz: return
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/fastcore/basics.py
Type: function
list(islice(ss, 5))
[0, 1, 2, 3, 4]
list(chunked(ss, 5))[:5]
[[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]]
batches = BatchSampler(ss, 4)
list(islice(batches, 5))
[[0, 1, 2, 3],
[4, 5, 6, 7],
[8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]]
There is one last piece of the puzzle left. Each sample in our Dataset
also stores its associated target. We need to split these apart when dataloading. In other words, we need to split the data and target in each sample into their own batches; into an x
batch and a y
batch.
def collate(b):
xs, ys = zip(*b)
return torch.stack(xs), torch.stack(ys)
class DataLoader():
def __init__(self, ds, batches, collate_fn=collate): store_attr()
def __iter__(self): yield from (self.collate_fn(self.ds[i] for i in b) for b in self.batches)
Let’s breakdown the latter line and explore what it does, piece by piece.
trn_samp = BatchSampler(Sampler(trn_ds, shuffle=True), bs)
vld_samp = BatchSampler(Sampler(vld_ds, shuffle=False), bs)
for b in self.batches
, we loop through each batch.
b = next(iter(vld_samp)); b[:10]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
self.ds[i] for i in b
; using the indices in each batch, we access the respective samples in the dataset.
p = [vld_ds[i] for i in b]; len(p)
50
As can be seen below, p
also stores the target.
p[0]
(tensor([0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.18, 0.62, 0.76, 0.80, 0.28, 0.34, 0.05, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.93, 0.99, 0.99, 0.99,
0.99, 0.99, 0.89, 0.33, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.05, 0.77, 0.69, 0.50, 0.69, 0.81, 0.92, 0.96, 0.87, 0.09, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.08, 0.54, 0.99, 0.37, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.30, 0.99,
0.56, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.07, 0.78, 0.99, 0.66, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.18, 0.85, 0.99, 0.84, 0.11, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.37, 0.88, 0.99, 0.96, 0.25, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.05, 0.50, 0.98, 0.99, 0.92,
0.16, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.67, 0.99, 0.99, 0.66, 0.23, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.81, 0.99, 0.99, 0.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.54, 0.99, 0.99, 0.98, 0.57, 0.10, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.04, 0.68, 0.88,
0.99, 0.99, 0.90, 0.28, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.03, 0.05, 0.99, 0.99, 0.99, 0.96, 0.41, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.18, 0.74, 0.99, 0.99, 0.88, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.04, 0.00, 0.00, 0.00, 0.00, 0.00, 0.07, 0.68, 0.99,
0.99, 0.10, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.14, 0.90, 0.61, 0.44,
0.34, 0.73, 0.75, 0.85, 0.99, 0.99, 0.86, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.47, 1.00, 0.99, 0.99, 0.99, 0.99, 1.00, 0.99, 0.99, 0.95, 0.26, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.54, 1.00, 0.99, 0.99, 0.99, 0.99, 1.00, 0.67, 0.18, 0.09, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, 0.28, 0.64, 0.74, 0.68, 0.68, 0.26, 0.02,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,
0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]),
tensor(3))
Then we simply run the collate function.
xs, ys = zip(*p); ys
(tensor(3),
tensor(8),
tensor(6),
tensor(9),
tensor(6),
tensor(4),
tensor(5),
tensor(3),
tensor(8),
tensor(4),
tensor(5),
tensor(2),
tensor(3),
tensor(8),
tensor(4),
tensor(8),
tensor(1),
tensor(5),
tensor(0),
tensor(5),
tensor(9),
tensor(7),
tensor(4),
tensor(1),
tensor(0),
tensor(3),
tensor(0),
tensor(6),
tensor(2),
tensor(9),
tensor(9),
tensor(4),
tensor(1),
tensor(3),
tensor(6),
tensor(8),
tensor(0),
tensor(7),
tensor(7),
tensor(6),
tensor(8),
tensor(9),
tensor(0),
tensor(3),
tensor(8),
tensor(3),
tensor(7),
tensor(7),
tensor(8),
tensor(4))
And there we have our collated x
and y
batches!
torch.stack(ys)
tensor([3, 8, 6, 9, 6, 4, 5, 3, 8, 4, 5, 2, 3, 8, 4, 8, 1, 5, 0, 5, 9, 7, 4, 1, 0, 3, 0, 6, 2, 9, 9, 4, 1, 3, 6, 8, 0, 7, 7, 6, 8, 9, 0, 3,
8, 3, 7, 7, 8, 4])
trn_samp = BatchSampler(Sampler(trn_ds, shuffle=True), bs)
vld_samp = BatchSampler(Sampler(vld_ds, shuffle=False), bs)
trn_dl = DataLoader(trn_ds, batches=trn_samp)
vld_dl = DataLoader(vld_ds, batches=vld_samp)
xb, yb = next(iter(vld_dl))
plt.imshow(xb[0].view(28, 28))
yb[0]
tensor(3)
xb.shape, yb.shape
(torch.Size([50, 784]), torch.Size([50]))
model, opt = get_model()
fit()
Loss: 1.03; Accuracy: 0.74
Loss: 0.46; Accuracy: 0.82
Loss: 0.30; Accuracy: 0.90
We do not need to update the fit()
function, as its logic remains the same despite our changes to the dataloader.
??fit
Signature: fit()
Docstring: <no docstring>
Source:
def fit():
for epoch in range(epochs):
for xb, yb in trn_dl:
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
report(loss, preds, yb)
File: /var/folders/fy/vg316qk1001227svr6d4d8l40000gn/T/ipykernel_52843/769712355.py
Type: function
We can speed up how quickly data is loaded by using multiple CPU cores.
%%timeit
it = iter(trn_dl)
227 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
import torch.multiprocessing as mp
class DataLoader():
def __init__(self, ds, batches, n_workers=1, collate_fun=collate): store_attr()
def __iter__(self):
with mp.Pool(self.n_workers) as ex: yield from ex.map(self.ds.__getitem__, iter(self.batches))
trn_dl = DataLoader(trn_ds, batches=trn_samp, n_workers=4)
%%timeit
it = iter(trn_dl)
197 ns ± 0.557 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Let’s break down how exactly our __iter__
method works.
We slice batches by specifying a list of indices.
trn_ds[[3, 6, 8, 1]]
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
tensor([1, 1, 1, 0]))
Behind the scenes, the square bracket notation calls the __getitem__
dunder method.
??trn_ds.__getitem__
Signature: trn_ds.__getitem__(i)
Docstring: <no docstring>
Source: def __getitem__(self, i): return self.x[i], self.y[i]
File: /var/folders/fy/vg316qk1001227svr6d4d8l40000gn/T/ipykernel_52843/694427655.py
Type: method
In fact, we can index directly using __getitem__
.
trn_ds.__getitem__([3, 6, 8, 1])
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
tensor([1, 1, 1, 0]))
Therefore, by dividing our batches into smaller sets, we can take advantage of the __getitem__
dunder method to allow each CPU core to handle a separate set of items.
So we can divide our batches into smaller sets that each CPU core can manage.
len(list(map(trn_ds.__getitem__, ([3, 6], [8, 1]))))
2
for o in map(trn_ds.__getitem__, ([3, 6], [8, 1])): print(o)
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]), tensor([1, 1]))
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]), tensor([1, 0]))
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler, BatchSampler
?BatchSampler
Init signature:
BatchSampler(
sampler: Union[torch.utils.data.sampler.Sampler[int], Iterable[int]],
batch_size: int,
drop_last: bool,
) -> None
Docstring:
Wraps another sampler to yield a mini-batch of indices.
Args:
sampler (Sampler or Iterable): Base sampler. Can be any iterable object
batch_size (int): Size of mini-batch.
drop_last (bool): If ``True``, the sampler will drop the last batch if
its size would be less than ``batch_size``
Example:
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/utils/data/sampler.py
Type: type
Subclasses:
PyTorch provides a wrapper which assembles the indices, sampled by our desired sampler, into batches.
?RandomSampler
Init signature:
RandomSampler(
data_source: Sized,
replacement: bool = False,
num_samples: Optional[int] = None,
generator=None,
) -> None
Docstring:
Samples elements randomly. If without replacement, then sample from a shuffled dataset.
If with replacement, then user can specify :attr:`num_samples` to draw.
Args:
data_source (Dataset): dataset to sample from
replacement (bool): samples are drawn on-demand with replacement if ``True``, default=``False``
num_samples (int): number of samples to draw, default=`len(dataset)`.
generator (Generator): Generator used in sampling.
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/utils/data/sampler.py
Type: type
Subclasses:
trn_samp = BatchSampler( RandomSampler(trn_ds), bs, drop_last=False)
vld_samp = BatchSampler(SequentialSampler(vld_ds), bs, drop_last=False)
To construct a dataloader with PyTorch, we have to provide the dataset and a sampler.
?DataLoader
Init signature:
DataLoader(
dataset: torch.utils.data.dataset.Dataset[+T_co],
batch_size: Optional[int] = 1,
shuffle: Optional[bool] = None,
sampler: Union[torch.utils.data.sampler.Sampler, Iterable, NoneType] = None,
batch_sampler: Union[torch.utils.data.sampler.Sampler[Sequence], Iterable[Sequence], NoneType] = None,
num_workers: int = 0,
collate_fn: Optional[Callable[[List[~T]], Any]] = None,
pin_memory: bool = False,
drop_last: bool = False,
timeout: float = 0,
worker_init_fn: Optional[Callable[[int], NoneType]] = None,
multiprocessing_context=None,
generator=None,
*,
prefetch_factor: Optional[int] = None,
persistent_workers: bool = False,
pin_memory_device: str = '',
)
Docstring:
Data loader. Combines a dataset and a sampler, and provides an iterable over
the given dataset.
The :class:`~torch.utils.data.DataLoader` supports both map-style and
iterable-style datasets with single- or multi-process loading, customizing
loading order and optional automatic batching (collation) and memory pinning.
See :py:mod:`torch.utils.data` documentation page for more details.
Args:
dataset (Dataset): dataset from which to load the data.
batch_size (int, optional): how many samples per batch to load
(default: ``1``).
shuffle (bool, optional): set to ``True`` to have the data reshuffled
at every epoch (default: ``False``).
sampler (Sampler or Iterable, optional): defines the strategy to draw
samples from the dataset. Can be any ``Iterable`` with ``__len__``
implemented. If specified, :attr:`shuffle` must not be specified.
batch_sampler (Sampler or Iterable, optional): like :attr:`sampler`, but
returns a batch of indices at a time. Mutually exclusive with
:attr:`batch_size`, :attr:`shuffle`, :attr:`sampler`,
and :attr:`drop_last`.
num_workers (int, optional): how many subprocesses to use for data
loading. ``0`` means that the data will be loaded in the main process.
(default: ``0``)
collate_fn (Callable, optional): merges a list of samples to form a
mini-batch of Tensor(s). Used when using batched loading from a
map-style dataset.
pin_memory (bool, optional): If ``True``, the data loader will copy Tensors
into device/CUDA pinned memory before returning them. If your data elements
are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type,
see the example below.
drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
if the dataset size is not divisible by the batch size. If ``False`` and
the size of dataset is not divisible by the batch size, then the last batch
will be smaller. (default: ``False``)
timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default: ``0``)
worker_init_fn (Callable, optional): If not ``None``, this will be called on each
worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
input, after seeding and before data loading. (default: ``None``)
generator (torch.Generator, optional): If not ``None``, this RNG will be used
by RandomSampler to generate random indexes and multiprocessing to generate
`base_seed` for workers. (default: ``None``)
prefetch_factor (int, optional, keyword-only arg): Number of batches loaded
in advance by each worker. ``2`` means there will be a total of
2 * num_workers batches prefetched across all workers. (default value depends
on the set value for num_workers. If value of num_workers=0 default is ``None``.
Otherwise if value of num_workers>0 default is ``2``).
persistent_workers (bool, optional): If ``True``, the data loader will not shutdown
the worker processes after a dataset has been consumed once. This allows to
maintain the workers `Dataset` instances alive. (default: ``False``)
pin_memory_device (str, optional): the data loader will copy Tensors
into device pinned memory before returning them if pin_memory is set to true.
.. warning:: If the ``spawn`` start method is used, :attr:`worker_init_fn`
cannot be an unpicklable object, e.g., a lambda function. See
:ref:`multiprocessing-best-practices` on more details related
to multiprocessing in PyTorch.
.. warning:: ``len(dataloader)`` heuristic is based on the length of the sampler used.
When :attr:`dataset` is an :class:`~torch.utils.data.IterableDataset`,
it instead returns an estimate based on ``len(dataset) / batch_size``, with proper
rounding depending on :attr:`drop_last`, regardless of multi-process loading
configurations. This represents the best guess PyTorch can make because PyTorch
trusts user :attr:`dataset` code in correctly handling multi-process
loading to avoid duplicate data.
However, if sharding results in multiple workers having incomplete last batches,
this estimate can still be inaccurate, because (1) an otherwise complete batch can
be broken into multiple ones and (2) more than one batch worth of samples can be
dropped when :attr:`drop_last` is set. Unfortunately, PyTorch can not detect such
cases in general.
See `Dataset Types`_ for more details on these two types of datasets and how
:class:`~torch.utils.data.IterableDataset` interacts with
`Multi-process data loading`_.
.. warning:: See :ref:`reproducibility`, and :ref:`dataloader-workers-random-seed`, and
:ref:`data-loading-randomness` notes for random seed related questions.
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/utils/data/dataloader.py
Type: type
Subclasses:
trn_dl = DataLoader(trn_ds, batch_sampler=trn_samp, collate_fn=collate)
vld_dl = DataLoader(vld_dl, batch_sampler=vld_samp, collate_fn=collate)
model, opt = get_model()
fit()
loss_func(model(xb), yb), accuracy(model(xb), yb)
Loss: 1.05; Accuracy: 0.64
Loss: 0.69; Accuracy: 0.72
Loss: 0.55; Accuracy: 0.84
(tensor(1.02, grad_fn=<NllLossBackward0>), tensor(0.66))
Instead of separately wrapping the RandomSampler
and SequentialSampler
classes, we can let the DataLoader
class do this for us.
trn_dl = DataLoader(trn_ds, bs, sampler= RandomSampler(trn_ds), collate_fn=collate)
vld_dl = DataLoader(vld_ds, bs, sampler=SequentialSampler(trn_ds), collate_fn=collate)
In fact, we don’t even need to specify the sampler. All we have to do is toggle and set some parameters.
trn_dl = DataLoader(trn_ds, bs, shuffle=True, drop_last=True, num_workers=2)
vld_dl = DataLoader(vld_ds, bs, shuffle=False, num_workers=2)
model, opt = get_model(); fit()
Loss: 0.80; Accuracy: 0.80
Loss: 0.27; Accuracy: 0.94
Loss: 0.40; Accuracy: 0.88
loss_func(model(xb), yb), accuracy(model(xb), yb)
(tensor(0.84, grad_fn=<NllLossBackward0>), tensor(0.68))
As our dataset already knows how to sample a batch of indices all at once, we can actually skip the batch_sampler
and collate_fn
entirely. 🙃
class Dataset():
def __init__(self, x, y): self.x, self.y = x, y
def __len__(self): return len(self.x)
def __getitem__(self, i): return self.x[i], self.y[i]
trn_ds[[4, 6, 7]]
(tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]),
tensor([9, 1, 3]))
trn_dl = DataLoader(trn_ds, sampler=trn_samp)
vld_dl = DataLoader(vld_ds, sampler=vld_samp)
xb, yb = next(iter(trn_dl)); xb.shape, yb.shape
(torch.Size([1, 50, 784]), torch.Size([1, 50]))
When training and evaluating a model, model.train()
and model.eval()
need to be called respectively. These methods are used by layers such as nn.BatchNorm2d
and nn.Dropout
to ensure appropriate behaviour during different phases of the process.
?model.train
Signature: model.train(mode: bool = True) -> ~T
Docstring:
Sets the module in training mode.
This has any effect only on certain modules. See documentations of
particular modules for details of their behaviors in training/evaluation
mode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,
etc.
Args:
mode (bool): whether to set training mode (``True``) or evaluation
mode (``False``). Default: ``True``.
Returns:
Module: self
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/nn/modules/module.py
Type: method
?model.eval
Signature: model.eval() -> ~T
Docstring:
Sets the module in evaluation mode.
This has any effect only on certain modules. See documentations of
particular modules for details of their behaviors in training/evaluation
mode, if they are affected, e.g. :class:`Dropout`, :class:`BatchNorm`,
etc.
This is equivalent with :meth:`self.train(False) <torch.nn.Module.train>`.
See :ref:`locally-disable-grad-doc` for a comparison between
`.eval()` and several similar mechanisms that may be confused with it.
Returns:
Module: self
File: ~/mambaforge/envs/default/lib/python3.10/site-packages/torch/nn/modules/module.py
Type: method
def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
for epoch in range(epochs):
model.train()
for xb, yb in train_dl:
preds = model(xb)
loss = loss_func(preds, yb)
loss.backward()
opt.step()
opt.zero_grad()
model.eval()
with torch.no_grad():
tot_loss, tot_acc, count = (0.,) * 3
for xb, yb in valid_dl:
preds = model(xb)
n = len(xb)
count += n
tot_loss += loss_func(preds, yb).item() * n
tot_acc += accuracy (preds, yb).item() * n
print(epoch, tot_loss/count, tot_acc/count)
return tot_loss/count, tot_acc/count
def get_dls(trainn_ds, valid_ds, bs, **kwargs):
return (DataLoader(trn_ds, batch_size=bs, shuffle=True, **kwargs),
DataLoader(vld_ds, batch_size=bs*2, **kwargs))
trn_dl, vld_dl = get_dls(trn_ds, vld_ds, bs)
model, opt = get_model()
%time loss, acc = fit(5, model, loss_func, opt, trn_dl, vld_dl)
0 1.3015430688858032 0.6180000007152557
1 0.7089294970035553 0.7680000007152558
2 0.6260120451450348 0.7990000009536743
3 0.501511612534523 0.8490000128746032
4 0.5909725487232208 0.8119999945163727
CPU times: user 1.55 s, sys: 41.8 ms, total: 1.59 s
Wall time: 358 ms
If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!
This notebook follows the fastai style guide.
Tokenization is the process whereby text is given a numerical representation. Sentences are split into components known as tokens. These tokens represent numerical values that language models can work with.
There are various approaches to tokenization. Examples include:
Language models require the use of their own tokenization technique to properly work. Let’s have a look at three approaches.
The word-based approach, well, splits sentences into individual words. In some cases, it also splits on punctuation.
In the example below, the sentence is tokenized into its words using whitespace.
"I'm really excited doing this, you know?".split()
["I'm", 'really', 'excited', 'doing', 'this,', 'you', 'know?']
Let’s see it split based on its punctuation.
import re
seq = "I'm really excited doing this, you know?"
toks = re.findall(r'\w+|[^\w\s]+', seq); toks
['I', "'", 'm', 'really', 'excited', 'doing', 'this', ',', 'you', 'know', '?']
After tokenizing, an ID is assigned to each word, or token, so the model can identify them.
The issue with the word-based approach is wend up with huge vocabularies^{1}, especially when splitting on punctuation. For instance, the English language has over 500,000 words, so we would also need more than 500,000 tokens.
^{1} A vocabulary is a collection of tokens.
^{2} Examples of such tokens include [UNK] or <unk>.
To remedy this, we could use only the most frequently used words However, the issue that arises here is when the tokenizer encounters a word not present in its vocabulary. In this situation, a token representing the concept of “unknown” would be assigned^{2}. When there are many such tokens, the model has no way of “knowing” that these tokens in fact represent different words.
Another issue with this approach is that the tokenizer will assign words such as “car” and “cars” different tokens. The model will not know that these two words are actually similar and represent almost the same concept.
This approach splits text into characters, resulting in a much, much smaller vocabulary – the English alphabet only has 26 letters, as opposed to hundreds of thousands of words. It also results in fewer unknown tokens as words are comprised from everything within the vocabulary.
list("Who doesn't love tokenization!")
['W',
'h',
'o',
' ',
'd',
'o',
'e',
's',
'n',
"'",
't',
' ',
'l',
'o',
'v',
'e',
' ',
't',
'o',
'k',
'e',
'n',
'i',
'z',
'a',
't',
'i',
'o',
'n',
'!']
However, this approach also has its drawbacks. Individual characters hold less meaning than a whole word. For example, ‘t’ holds less meaning than ‘tokenization’.
That said, this issue is not as prevalent in other languages. In Chinese languages, each character is also a word. Therefore, characters in Chinese languages hold less meaning than characters in Latin languages.
While there will be an overall smaller vocabulary, there will still be much processing to do – we end up with a large amount of individual tokens to process. ‘Hello!’ would need only a single token, where as ‘H’, ‘e’, ‘l’, ‘l’, ‘o’, and ‘!’ would require six tokens.
This approach is a combination of the two approaches above, and is also the approach most state-of-the-art tokenizers use today.
With subword-based tokenizers, words fall into two categories: frequent words and rare words. Frequent words are not to be split, but rare words are to be split into meaningful subwords.
For example, ‘tokenization’ would be categorized as a rare word and would be tokenized into the tokens ‘token’ and ‘ization’. Though one word is now represented by two tokens, as opposed to a single token with the word-based approach, it is split into two components that much more frequently appear. We also don’t need eleven tokens, as would be with the character-based approach. On top of that, the model would learn the grammatical function of ‘ization’.
This is all while giving the model the ability to learn the meaning of ‘realization’ as the two tokens that comprise the word appear next to each other
This approach allows us to have relatively good covereage for a language while having relatively smaller vocabularies. It also results in minimal unknown tokens.
If we draw parallels between a tokenizer and a model, the algorithm of a tokenizer is akin to the architecture of a model. On a similar note, the vocabulary of a tokenizer is akin to the weights of a model.
Let’s load in the tokenizer used for the BERT base model (cased).
! pip install -Uqq transformers
1import logging; logging.disable(logging.WARNING)
from transformers import AutoTokenizer
tokz = AutoTokenizer.from_pretrained('bert-base-cased')
We can use the loaded tokenizer to directly tokenize our desired sequence.
seq = "The process of tokenization has lead me to the appalling conclusion: life isn't what it is."
tokz(seq)
{'input_ids': [101, 1109, 1965, 1104, 22559, 2734, 1144, 1730, 1143, 1106, 1103, 12647, 5727, 1158, 6593, 131, 1297, 2762, 112, 189, 1184, 1122, 1110, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
However, let’s look behind the scenes to see what’s happening. We’ll only focus on how input_ids
came to be.
Encoding is the name given to the process whereby text is mapped to numbers. Text is first tokenized, after which, the tokens are mapped to their respective IDs.
toks = tokz.tokenize(seq); toks
['The',
'process',
'of',
'token',
'##ization',
'has',
'lead',
'me',
'to',
'the',
'app',
'##all',
'##ing',
'conclusion',
':',
'life',
'isn',
"'",
't',
'what',
'it',
'is',
'.']
As we can see, the tokenizer used by the BERT base model (cased) is a subword-based tokenizer. This can be seen by ‘tokenization’ being split into ‘token’ and ‘##ization’, as well as ‘appalling’ being split into ‘app’, ‘##all’, and ‘##ing’.
ids = tokz.convert_tokens_to_ids(toks); ids
[1109,
1965,
1104,
22559,
2734,
1144,
1730,
1143,
1106,
1103,
12647,
5727,
1158,
6593,
131,
1297,
2762,
112,
189,
1184,
1122,
1110,
119]
The numbers that have been assigned are based on the vocabulary of the tokenizer. These IDs can now be used as input to a model.
Decoding is simply the opposite process: convert a sequence of IDs into their respective tokens, including putting together tokens that were part of the same word.
dec_seq = tokz.decode(ids); dec_seq
"The process of tokenization has lead me to the appalling conclusion : life isn't what it is."
The decoding algorithm of our tokenzier has introduced a space before the colon. 🤔
Decoding is used for models that generate text: the model outputs a sequence of IDs which are then decoded to their respective tokens.
Tokenization is all about splitting text up and giving the split up text a numerical representation that computers can work with.
If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!
This guide assumes a basic understanding of derivatives and matrices.
Backpropagation sounds and looks daunting. It doesn’t need to be. In fact, backpropagation is really just a fancy word for the chain rule. Implementing a backpropagation algorithm is simply implementing one big fat chain rule equation.
Let’s remind ourselves of the chain rule. The chain rule lets us figure out how much a given variable indirectly changes with respect to another variable. Take the example below.
We want to figure out how much changes with each increment in . The problem is that doesn’t direcly change . Rather, changes which in turn changes .
The chain rule allows us to solve this problem. In this case, the chain rule tells us that we can figure out how much indirecly changes by multiplying the derivative of with respect to , and the derivative of with respect to .
Aaand I’ve just described backpropagation in a nutshell. That’s all there really is to it. The only difference is that in a neural network there are many more intermediate variables and functions, and that we want to find out how the weights indirectly change the loss.
Let’s see this tangibly in action.
We have the following neural network comprised of two layers: the first layer contains the affine function^{1} together with the ReLU, while the second layer contains only the affine function. The loss, which is MSE (Mean Squared Error), will then be calculated from the output of the second layer.
^{1} Affine function is a fancy name for the linear function
Mathematically speaking, the first layer with a single sample looks like this.
The second layer looks like this.
And the loss function looks like this.
MSE in its most basic form looks like this.
If we have multiple data points, then it looks like this.
However, when working with multiple samples, the mean squared error comes out looking like this, where represents the total number of samples.
Or more simply…^{2}
^{2} is known as the summation or sigma operator. If we have the equation , it means sum the equation for all values of from 1 to 4. Find out more here.
…or even more simply.
Our goal for the rest of this guide is to derive the gradients of .
The equation above looks quite the mouthful though. One might even say scary. How would you even apply the chain rule here? How would you use the chain rule to derive the gradients of the weights and biases?
Let’s simplify things by introducing a bunch of intermediate variables. We’ll begin by substituting the innermost pieces of the equation, and then gradually make our way out.
The menacing equation above now gradually simplifies into the cute equation below.
Very cute, hey?
In this cuter version of the equation, it is visible that incrementing does not directly change the MSE. Rather, incrementing changes , which changes , which changes , which changes , which in turn changes .
See? Just a big, fat, and simple chain rule problem.
is a curly “d” and can be read as “curly d”, or simply as “d”. notation will be used below, due to a concept known as partial derivatives. We will not go into this concept here, however, this is a great brief rundown on partial derivatives.
Now we can tackle finding the gradients for . To do so, let’s find the gradients of each intermediate variable.^{3} ^{4}
^{3} If needed, get a refresher of the derivative rules here.
^{4} denotes a piecewise function. The most simplest piecewise function returns one calculation if a condition is met, and another calculation if the condition is not met. It can be thought of as an if-else statement in programming. Find out more here.
Now we multiply everything together.
And it all eventually expands out to the following.
We can further simplify by taking and common.
We can simplify even further, by letting . The stands for “error”.
And there you go! We’ve derived the formula that will allow us to calculate the gradients of .
When implementing backpropagation in a program, it is often better to implement the entire equation in pieces, as opposed to a single line of code, through storing the result of each intermediate gradient. These intermediate gradients can be reused to calculate the gradients of another variable, such as the bias .
Instead of implementing the following in a single line of code.
We can instead first calculate the gradients of .
Then calculate the gradients of and multiply it with it with the gradients of .
Then multiply the product above with the gradients of .
Then multiply the product above with the gradients of .
And finally multiply the product above with the gradients of
Let’s see this using Python instead.
The following is our neural network.
l1
is the first layer, .
l2
is the second layer, .
loss
is the MSE, .
First we need to calculate the gradients of .
diff = (trn_y - l2)
1loss.g = (2/trn_x.shape[0]) * diff
trn_x.shape[0]
, in this case, returns the total number of samples.
Next are the gradients of
diff.g = loss.g * -1
Then the gradients of
l2.g = diff.g @ w2.T
Then the gradients of
l1.g = l2.g * (l1 > 0).float()
And finally the gradients of .
w1.g = (l1.g * trn_x).sum()
The equation for the gradient of is almost the same as the equation for the gradients of , save for the last line where we do not have to matrix multiply with . Therefore, we can reuse all previous gradient calculations to find the gradient of .
b1.g = (l1.g * 1).sum()
When multiplying various tensors together, make sure their shapes are compatible. Shape manipulations have been omitted above for simplicity.
And that’s all there really is to backpropagation; think of it a one big chain rule problem.
To make sure you’ve got it hammered down, get out a pen and paper and derivate the equations that would compute the gradients of , , , and respectively with respect to the MSE.
And if you really want to hammer down your understanding on what’s happening, then I highly recommend reading The Matrix Calculus You Need For Deep Learning. I’ve also compiled backpropagation practice questions from this paper!
If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!
This notebook follows the fastai style guide.
Meanshift clustering is a technique for unsupervised learning. Give this algorithm a bunch of data and it will figure out what groups the data can be sorted into. It does this by iteratively moving all data points until they converge to a single point.
The steps of the algorithm can be summarized as follows:
This is the data we will work with to illustrate meanshift clustering. The data points are put into clearly seperate clusters for the sake of clarity.
In the end, all clusters will converge at their respective center (marked by X).
Let’s start off simple and apply the algorithm to a single point.
For each data point in the dataset, calculate the distance between and every other data point in the dataset.
data
tensor([[ 0.611, -20.199],
[ 4.455, -24.188],
[ 2.071, -20.446],
...,
[ 25.927, 6.597],
[ 18.549, 3.411],
[ 24.617, 8.485]])
X = data.clone(); X.shape
torch.Size([1500, 2])
Each point has an coordinate and a coordinate.
x = X[0, :]; x - X
tensor([[ 0.000, 0.000],
[ -3.844, 3.989],
[ -1.460, 0.247],
...,
[-25.316, -26.796],
[-17.938, -23.610],
[-24.006, -28.684]])
The distance metric we’ll use is Euclidean distance — also better known as Pythagoras’ theorem.
dists = (x - X).square().sum(dim=1).sqrt(); dists
tensor([ 0.000, 5.540, 1.481, ..., 36.864, 29.651, 37.404])
Calculate weights for each point in the dataset by passing the calculated distances through the normal distribution.
The normal distribution is also known as the Gaussian distribution. A distribution is simply a way to describe how data is spread out — this isn’t applicable in our case. What is applicable is the shape of this distribution which we will use to calculate the weights.
def gauss_kernel(x, mean, std):
return torch.exp(-(x - mean) ** 2 / (2 * std ** 2)) / (std * torch.sqrt(2 * tensor(torch.pi)))
This is how it looks like.
From the shape of this graph, we can see that larger values of give smaller values of , which is what we want — longer distances should have smaller weights meaning they have a smaller effect on the new position of the point.
We can control the rate at which the weights go to zero by varying what’s known as the bandwidth, or the standard deviation. The graph above is generated with a bandwith of 2.5.
The graph below is generated with a bandwidth of 1.
Let’s get our weights now.
gauss_kernel(dists, mean=0, std=2.5)
tensor([ 0.160, 0.014, 0.134, ..., 0.000, 0.000, 0.000])
bw = 2.5
ws = gauss_kernel(x=dists, mean=0, std=bw)
Calculate the weighted average for all points in the dataset. This weighted average is the new location for
ws.shape, X.shape
(torch.Size([1500]), torch.Size([1500, 2]))
ws[:, None].shape, X.shape
(torch.Size([1500, 1]), torch.Size([1500, 2]))
Below is the formula for weighted average.
In words, multiply each data point in the set with its corresponding weight and sum all products. Divide that with the sum of all weights.
ws[:, None] * X, ws[0] * X[0, :]
(tensor([[ 0.097, -3.223],
[ 0.061, -0.331],
[ 0.277, -2.738],
...,
[ 0.000, 0.000],
[ 0.000, 0.000],
[ 0.000, 0.000]]),
tensor([ 0.097, -3.223]))
Let’s calculate the weighted average and assign it as the new location for our point .
x = (ws[:, None] * X).sum(dim=0) / ws.sum(); x
tensor([ 1.695, -20.786])
And there you have it! We just moved a single data point.
Let’s do this for all data points and for a single iteration.
for i, x in enumerate(X):
dist = (x - X).square().sum(dim=1).sqrt()
ws = gauss_kernel(x=dist, mean=0, std=bw)
X[i] = (ws[:, None] * X).sum(dim=0) / ws.sum()
plot_data(centroids+2, X, n_samples)
Let’s encapsulate the algorithm so we can run it for multiple iterations.
def update(X):
for i, x in enumerate(X):
dist = (x - X).square().sum(dim=1).sqrt()
ws = gauss_kernel(x=dist, mean=0, std=bw)
X[i] = (ws[:, None] * X).sum(dim=0) / ws.sum()
def meanshift(data):
X = data.clone()
for _ in range(5): update(X)
return X
plot_data(centroids+2, meanshift(data), n_samples)
All points have converged.
%timeit -n 10 meanshift(data)
1.7 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The algorithm took roughly 1.5 seconds to run 5 iterations. We’ll optimize the algorithm further in Optimized Implementation.
As we can see below, simply moving the algorithm to the GPU won’t help — in fact, it becamse a bit slower.
def update(X):
for i, x in enumerate(X):
dist = (x - X).square().sum(dim=1).sqrt()
ws = gauss_kernel(x=dist, mean=0, std=bw)
X[i] = (ws[:, None] * X).sum(dim=0) / ws.sum()
def meanshift(data):
X = data.clone().to('cuda')
for _ in range(5): update(X)
return X.detach().cpu()
%timeit -n 10 meanshift(data)
1.67 s ± 49.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Let’s see meanshift clustering happen in real time.
X = data.clone()
fig = plot_data(centroids+2, X, n_samples, display=False)
fig.update_layout(xaxis_range=[-40, 40], yaxis_range=[-40, 40], updatemenus=[dict(type='buttons', buttons=[
dict(label='Play', method='animate', args=[None]),
dict(label='Pause', method='animate', args=[[None], dict(frame_duration=0, frame_redraw='False', mode='immediate', transition_duration=0)])
])])
frames = [go.Frame(data=fig.data)]
for _ in range(5):
update(X)
frames.append(go.Frame(data=plot_data(centroids+2, X, n_samples, display=False).data))
fig.frames = frames
fig.show()