### CNN Weights - Learnable Parameters in Neural Networks

Welcome back to this series on neural network programming with PyTorch. It’s time now to learn about the weight tensors inside our CNN. We’ll find that these weight tensors live inside our layers and are learnable parameters of our network. Without further ado, let’s get started.

### Our Neural Network

In the last couple of posts in this series, we’ve started building our CNN, and we put in some work to understand the layers we defined inside our network’s constructor.

class Network(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5) self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5) self.fc1 = nn.Linear(in_features=12*4*4, out_features=120) self.fc2 = nn.Linear(in_features=120, out_features=60) self.out = nn.Linear(in_features=60, out_features=10) def forward(self, t): # implement the forward pass return t

Ultimately, our next step in the overall process is to use these layers inside our network’s forward method, but right now, let’s take a look at the learnable parameters inside our network.

We already know about hyperparameters. We saw that hyperparameters are parameters whose values are picked arbitrarily.

The hyperparameters we’ve used up to this point were the parameters that we used to construct our network’s architecture though the layers we constructed and assigned as class attributes.

These hyperparameters aren’t the only hyperparameters though, and we will see more hyperparameters when we start the training process. What we are concerned with now is the learnable parameters of our network.

### Learnable Parameters

*Learnable parameters* are parameters whose values are learned during the training process.

With learnable parameters, we typically start out with a set of arbitrary values, and these values then get updated in an iterative fashion as the network learns.

In fact, when we say that a network is learning, we specifically mean that the network is learning the appropriate values for the learnable parameters. Appropriate values are values that minimize the loss function.

When it comes to our network, we might be thinking, where are these learnable parameters?

We’ll the learnable parameters are the weights inside our network, and they live inside each layer.

### Getting an Instance the Network

In PyTorch, we can inspect the weights directly. Let’s grab an instance of our network class and see this.

network = Network()

Remember, to get an object instance of our Network class, we type the class name followed by parentheses. When this code executes, the code inside the `__init__`

class constructor will run,
assigning our layers as attributes before the object instance is returned.

The name `__init__`

is short for initialize. In an object’s case, the attributes are initialized with values, and these values can indeed be other objects. In this way, objects can
be nested inside other objects.

This is the case with our network class whose class attributes are initialized with instances of PyTorch layer classes. After the object is initialized, we can then access our object using the network variable.

Before we start to work with our newly created network object, have a look at what happens when we pass our network to Python’s `print()`

function.

> print(network) Network( (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1)) (fc1): Linear(in_features=192, out_features=120, bias=True) (fc2): Linear(in_features=120, out_features=60, bias=True) (out): Linear(in_features=60, out_features=10, bias=True) )

The `print()`

function prints to the console a string representation of our network. With a sharp eye, we can notice that the printed output here is detailing our network’s architecture
listing out our network’s layers, and showing the values that were passed to the layer constructors.

#### Network String Representation

One question is though. How is that happening?

We’ll our network class is inheriting this functionality from the PyTorch Module base class. Watch what happens if we stop extending the neural network module class.

> print(network) <__main__.Network object at 0x0000017802302FD0>

Now, we don’t get that nice descriptive output like before. Instead we get this technical gibberish which is the default Python string representation that we get if we don’t provide one.

For this reason, in object oriented programming, we usually want to provide a string representation of our object inside our classes so that we get useful information when the object is printed. This string representation comes from Python’s default base class called object.

#### How Overriding Works

All Python classes automatically extend the object class. If we want to provide a custom string representation for our object, we can do it, but we need to introduce another object oriented concept called
*overriding*.

When we extend a class, we get all of its functionality, and to complement this, we can add additional functionality. However, we can also override existing functionality by changing it to behave differently.

We can override Python’s default string representation using the `__repr__`

function. This name is short for
*representation*.

def __repr__(self): return "lizardnet"

This time when we pass the network to the print function the string that we specified in our class definition is printed in place of the Python’s default string.

> print(network) lizardnet

When we talked about OOP before, we learned about the `__init__`

method and how it is a special Python method for constructing objects.

We’ll there are other special methods we’ll encounter and `__repr__`

is one of them. All the special OOP Python methods typically have the double underscore pre and post-fixes.

This is how the PyTorch Module base class works as well. The Module base class overrides the `__repr__`

function.

### What’s in the string representation?

For the most, the string representation that PyTorch gives us pretty much matches what we would expect based on how we configured our network’s layers.

However, there is a bit of additional information that we should highlight.

#### Convolutional Layers

For the convolutional layers, the kernel_size argument is a Python tuple `(5,5)`

even though we only passed the number `5`

in the constructor.

Network( (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1)) (fc1): Linear(in_features=192, out_features=120, bias=True) (fc2): Linear(in_features=120, out_features=60, bias=True) (out): Linear(in_features=60, out_features=10, bias=True) )

This is because our filters actually have a height and width, and when we pass a single number, the code inside the layer’s constructor assumes that we want a square filter.

self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5) self.conv2 = nn.Conv2d(in_channels=6, out_channels=12, kernel_size=5)

The stride is an additional parameter that we could have set, but we left it out. When the stride is not specified in the layer constructor the layer automatically sets it.

The stride tells the conv layer how far the filter should slide after each operation in the overall convolution. This tuple says to slide by one unit when moving to the right and also by one unit when moving down.

#### Linear Layers

For the linear layers, we have an additional parameter called bias which has a default parameter value of true. It is possible to turn this off by setting it to false.

self.fc1 = nn.Linear(in_features=12*4*4, out_features=120) self.fc2 = nn.Linear(in_features=120, out_features=60) self.out = nn.Linear(in_features=60, out_features=10)

One thing to note about the information displayed for our objects when we print them is that it’s completely arbitrary information.

As developers, we can decide to put any information there. However, the Python documentation tells us that the info should be complete enough that it can be used to reconstruct the object if needed.

### Accessing the Network's Layers

Well, now that we’ve got an instance of our network and we’ve reviewed our layers, let’s see how we can access them in code.

In Python and many other programming languages, we access attributes and methods of objects using dot notation.

> network.conv1 Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)) > network.conv2 Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1)) > network.fc1 Linear(in_features=192, out_features=120, bias=True) > network.fc2 Linear(in_features=120, out_features=60, bias=True) > network.out Linear(in_features=60, out_features=10, bias=True)

This is dot notation in action. With dot notation, we use a dot to indicate that we want to sort of open up the object and access something that’s inside. We’ve already been using this quite a bit, so the mention here just gives us a label for the concept.

Something to notice about this that pertains directly to what we were just talking about with the string representation of the network is that each of these pieces of code are also giving us a string representation of each layer.

In the network’s case, the network class is really just compiling all this data together to give us a single output.

Network( (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1)) (conv2): Conv2d(6, 12, kernel_size=(5, 5), stride=(1, 1)) (fc1): Linear(in_features=192, out_features=120, bias=True) (fc2): Linear(in_features=120, out_features=60, bias=True) (out): Linear(in_features=60, out_features=10, bias=True) )

One last thing to mention about the string representation of these objects is that, in this case, we aren’t actually using the print method.

The reason we are still getting back the string representation is because we are using Jupyter notebook, and behind the scenes the notebook is accessing the string representation so it can have something to show us. This is like a really good example of the main use case for string representations.

### Accessing the Layer Weights

Now that we have access to each of our layers, we can access the weights inside each layer. Let’s see this for our first convolutional layer.

> network.conv1.weight Parameter containing: tensor([[[[ 0.0692, 0.1029, -0.1793, 0.0495, 0.0619], [ 0.1860, 0.0503, -0.1270, -0.1240, -0.0872], [-0.1924, -0.0684, -0.0028, 0.1031, -0.1053], [-0.0607, 0.1332, 0.0191, 0.1069, -0.0977], [ 0.0095, -0.1570, 0.1730, 0.0674, -0.1589]]], [[[-0.1392, 0.1141, -0.0658, 0.1015, 0.0060], [-0.0519, 0.0341, 0.1161, 0.1492, -0.0370], [ 0.1077, 0.1146, 0.0707, 0.0927, 0.0192], [-0.0656, 0.0929, -0.1735, 0.1019, -0.0546], [ 0.0647, -0.0521, -0.0687, 0.1053, -0.0613]]], [[[-0.1066, -0.0885, 0.1483, -0.0563, 0.0517], [ 0.0266, 0.0752, -0.1901, -0.0931, -0.0657], [ 0.0502, -0.0652, 0.0523, -0.0789, -0.0471], [-0.0800, 0.1297, -0.0205, 0.0450, -0.1029], [-0.1542, 0.1634, -0.0448, 0.0998, -0.1385]]], [[[-0.0943, 0.0256, 0.1632, -0.0361, -0.0557], [ 0.1083, -0.1647, 0.0846, -0.0163, 0.0068], [-0.1241, 0.1761, 0.1914, 0.1492, 0.1270], [ 0.1583, 0.0905, 0.1406, 0.1439, 0.1804], [-0.1651, 0.1374, 0.0018, 0.0846, -0.1203]]], [[[ 0.1786, -0.0800, -0.0995, 0.1690, -0.0529], [ 0.0685, 0.1399, 0.0270, 0.1684, 0.1544], [ 0.1581, -0.0099, -0.0796, 0.0823, -0.1598], [ 0.1534, -0.1373, -0.0740, -0.0897, 0.1325], [ 0.1487, -0.0583, -0.0900, 0.1606, 0.0140]]], [[[ 0.0919, 0.0575, 0.0830, -0.1042, -0.1347], [-0.1615, 0.0451, 0.1563, -0.0577, -0.1096], [-0.0667, -0.1979, 0.0458, 0.1971, -0.1380], [-0.1279, 0.1753, -0.1063, 0.1230, -0.0475], [-0.0608, -0.0046, -0.0043, -0.1543, 0.1919]]]], requires_grad=True )

The output is a tensor, but before we look at the tensor, let’s talk OOP for a moment. This is a good example that showcases how objects are nested. We first access the conv layer object that lives inside the network object.

network.conv1.weight

Then, we access the weight tensor object that lives inside the conv layer object, so all of these objects are chained or linked together.

One thing to notice about the weight tensor output is that it says
*parameter containing* at the top of the output. This is because this particular tensor is a special tensor because its values or scalar components are learnable parameters of our network.

This means that the values inside this tensor, the ones we see above, are actually learned as the network is trained. As we train, these weight values are updated in such a way that the loss function is minimized.

#### PyTorch Parameter Class

To keep track of all the weight tensors inside the network. PyTorch has a special class called `Parameter`

. The `Parameter`

class extends the tensor class, and so the weight tensor
inside every layer is an instance of this `Parameter`

class. This is why we see the `Parameter containing`

text at the top of the string representation output.

We can see in the Pytorch source code that the `Parameter`

class is overriding the `__repr__`

function by prepending the text parameter containing to the regular tensor class representation
output.

def __repr__(self): return 'Parameter containing:\n' + super(Parameter, self).__repr__()

PyTorch’s `nn.Module`

class is basically looking for any attributes whose values are instances of the Parameter class, and when it finds an instance of the parameter class, it keeps
track of it.

All of this is really technical PyTorch details that go on behind the scenes, and we’ll see this come in to play in a bit.

For our understanding now though, the important part is the interpretation of the shape of the weight tensors. This is where we’ll start to use the knowledge we learned about tensors early on in the series.

Let’s look at the shapes now, and then interpret them.

### Weight Tensor Shape

In the last post, we said that the parameter values we pass to our layers directly impact our network’s weights. This is where will see this impact.

For the convolutional layers, the weight values live inside the filters, and in code, the filters are actually the weight tensors themselves.

The convolution operation inside a layer is an operation between the input channels to the layer and the filter inside the layer. This means that what we really have is an operation between two tensors.

With that being said, let’s interpret these weight tensors which will allow us to better understand the convolution operations inside our network.

Remember, the shape of a tensor really encodes all the information we need to know about the tensor.

For the first conv layer, we have `1`

color channel that should be convolved by `6`

filters of size `5x5`

to produce `6`

output channels. This is how we interpret
the values inside our layer constructor.

> network.conv1 Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))

Inside our layer though, we don’t explicitly have `6`

weight tensors for each of the `6`

filters. We actually represent all `6`

filters using a single weight tensor
whose shape reflects or accounts for the `6`

filters.

The shape of the weight tensor for the first convolutional layer shows us that we have a rank-4 weight tensor. The first axis has a length of `6`

, and this accounts for the `6`

filters.

> network.conv1.weight.shape torch.Size([6, 1, 5, 5])

The second axis has a length of `1`

which accounts for the single input channel, and the last two axes account for the height and width of the filter.

The way to think about this is as if we are packaging all of our filters into a single tensor.

Now, the second conv layer has `12`

filters, and instead of convolving a single input channel, there are `6`

input channels coming from the previous layer.

> network.conv2.weight.shape torch.Size([12, 6, 5, 5])

Think of this value of `6`

here as giving each of the filters some depth. Instead of having a filter that convolves all of the channels iteratively, our filter has a depth that matches the
number of channels.

The two main take always about these convolutional layers is that our filters are represented using a single tensor and that each filter inside the tensor also has a depth that accounts for the input channels that are being convolved.

- All filters are represented using a single tensor.
- Filters have depth that accounts for the input channels.

Our tensors are rank-4 tensors. The first axis represents the number of filters. The second axis represents the depth of each filter which corresponds to the number of input channels being convolved.

The last two axes represent the height and width of each filter. We can pull out any single filter by indexing into the weight tensor’s first axis.

This gives us a single filter that has a height and width of `5`

and a depth of `6`

.

### Weight Matrix

With linear layers or fully connected layers, we have flattened rank-1 tensors as input and as output. The way we transform the in_features to the out_features in a linear layer is by using a rank-2 tensor that is commonly called a weight matrix.

This is due to the fact that the weight tensor is of rank-2 with height and width axes.

> network.fc1.shape torch.Size([120, 192]) > network.fc2.shape torch.Size([60, 120]) > network.out.shape torch.Size([10, 60])

Here we can see that each of our linear layers have a rank-2 weight tensor. The pattern that we can see here is that the height of the weight tensor has the length of the desired output features and a width of the input features.

#### Matrix Multiplication

This fact is due to how matrix multiplication is performed. Let’s see this in action with a smaller example.

Suppose we have two rank-2 tensors. The first has a shape of `3x4`

and the second has a shape of `4x1`

. Now, since we are demonstrating something called matrix multiplication, we’ll
note that both of these rank-2 tensors are indeed
*matrices*.

For each row-column combination in the output, the value is obtained by taking the dot product of the corresponding row of the first matrix with the corresponding column of the second matrix.

Since the second matrix in our example only has `1`

column, we use it all three times, but this idea generalizes.

The rule for this operation to work is that the number of columns in the first matrix must match the number of rows in the second matrix. If this rule holds, matrix multiplication operations like this can be performed.

The dot product means that we sum the products of corresponding components. In case you are wondering, both the dot product and matrix multiplication are linear algebra concepts.

#### Linear Function Represented Using a Matrix

The important thing about matrix multiplications like this is that they represent linear functions that we can use to build up our neural network.

Specifically, the weight matrix is a linear function also called a linear map that maps a vector space of `4`

dimensions to a vector space of `3`

dimensions.

When we change the weight values inside the matrix, we are actually changing this function, and this is exactly what we want to do as we search for the function that our network is ultimately approximating.

Let’s see how to perform this same computation using PyTorch.

#### Using PyTorch for Matrix Multiplication

Here, we have the `in_features`

and the `weight_matrix`

as tensors, and we’re using the tensor method called `matmul()`

to perform the operation. The name `matmul()`

as we now know is short for matrix multiplication.

> weight_matrix.matmul(in_features) tensor([30., 40., 50.])

A looming question is, how can we access all of the parameters at once? There is an easy way. Let me just show you.

### Accessing the Networks Parameters

The first example is the most common way, and we’ll use this to iterate over our weights when we update them during the training process.

for param in network.parameters(): print(param.shape) torch.Size([6, 1, 5, 5]) torch.Size([6]) torch.Size([12, 6, 5, 5]) torch.Size([12]) torch.Size([120, 192]) torch.Size([120]) torch.Size([60, 120]) torch.Size([60]) torch.Size([10, 60]) torch.Size([10])

The second way is just to show how we can see the name as well. This reveals something that we won’t cover in detail, the bias is also a learnable parameter. Each layer has a bias by default, so for each layer we have a weight tensor and a bias tensor.

for name, param in network.named_parameters(): print(name, '\t\t', param.shape) conv1.weight torch.Size([6, 1, 5, 5]) conv1.bias torch.Size([6]) conv2.weight torch.Size([12, 6, 5, 5]) conv2.bias torch.Size([12]) fc1.weight torch.Size([120, 192]) fc1.bias torch.Size([120]) fc2.weight torch.Size([60, 120]) fc2.bias torch.Size([60]) out.weight torch.Size([10, 60]) out.bias torch.Size([10])

### Wrapping up

We should now have a good understanding of learnable parameters, where the live inside our network, and how to access the weight tensors using PyTorch.

In the next post, we'll see how to work with our layers by passing tensors to them. I'll see you there.