Machine Learning & Deep Learning Fundamentals

with deeplizard.

How a Neural Network Learns explained

November 22, 2017 by

Blog

Learning in artificial neural networks

In this post, we’ll investigate what it means for an artificial neural network to learn.

In a previous post, we learned about the training process and saw that each data point used for training is passed through the network. This pass through the network from input to output is called a forward pass, and the resulting output depends on the weights at each connection inside the network.

Once all of the data points in our dataset have been passed through the network, we say that an epoch is complete.

An epoch refers to a single pass of the entire dataset to the network during training.

Note that many epochs occur throughout the training process as the model learns.

What does it mean to learn?

So what exactly does it mean for the model to learn?

Well, remember, when the model is initialized, the network weights are set to arbitrary values. We have also seen that, at the end of the network, the model will provide the output for a given input.

Once the output is obtained, the loss (or the error) can be computed for that specific output by looking at what the model predicted versus the true label. The loss computation depends on the chosen loss function, which we'll cover in more detail in a later post.

After the loss is calculated, the gradient of this loss function is computed with respect to each of the weights within the network. Note, gradient is just a word for the derivative of a function of several variables.

Continuing with this explanation, let’s focus in on only one of the weights in the model.

At this point, we’ve calculated the loss of a single output, and we calculate the gradient of that loss with respect to our single chosen weight. This calculation is done using a technique called backpropagation, which is covered in full detail starting here.

Once we have the value for the gradient of the loss function, we can use this value to update the model’s weight. The gradient tells us which direction will move the loss towards the minimum, and our task is to move in a direction that lowers the loss and steps closer to this minimum value.

Learning rate

We then multiply the gradient value by something called a learning rate. A learning rate is a small number usually ranging between 0.01 and 0.0001, but the actual value can vary.

The learning rate tells us how large of a step we should take in the direction of the minimum.

Just keep this in mind for now, and we’ll look more closely at learning rates in a future post.

Updating the weights

Alright, so we multiply the gradient with the learning rate, and we subtract this product from the weight, which will give us the new updated value for this weight.

new weight = old weight - (learning rate * gradient)

In this discussion, we just focused on one single weight to explain the concept, but this same process is going to happen with each of the weights in the model each time data passes through it.

The only difference is that when the gradient of the loss function is computed, the value for the gradient is going to be different for each weight because the gradient is being calculated with respect to each weight.

So now imagine all these weights being iteratively updated with each epoch. The weights are going to be incrementally getting closer and closer to their optimized values while SGD works to minimize the loss function.

The model is learning

This updating of the weights is essentially what we mean when we say that the model is learning. It’s learning what values to assign to each weight based on how those incremental changes are affecting the loss function. As the weights change, the network is getting smarter in terms of accurately mapping inputs to the correct output.

Alright, now, with this explanation along with our last post on training, we should now have a general idea about what it means to train a model and how the model learns through this training process.

Let’s look now at how this training is done with code in Keras.

Training in code with Keras

In order to train the model, the first thing required of us is to build the model.

Let's begin by importing the required classes:

import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense
from keras.metrics import categorical_crossentropy

Next, we define our model:

model = Sequential([
Dense(16, input_shape=(1,), activation='relu'),
Dense(32, activation='relu'),
Dense(2, activation='sigmoid')
])

Before we can train our model, we must compile it like so:

model.compile(
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)

To the compile() function, we are passing the optimizer, the loss function, and the metrics that we would like to see. Notice that the optimizer we have specified is called Adam. Adam is just a variant of SGD. Inside the Adam constructor is where we specify the learning rate, and in this case Adam(lr=.0001), we have chosen 0.0001.

Finally, we fit our model to the data. Fitting the model to the data means to train the model on the data. We do this with the following code:

model.fit(
scaled_train_samples,
train_labels,
batch_size=10,
epochs=20,
shuffle=True,
verbose=2
)

scaled_train_samples is a numpy array consisting of the training samples.

train_labels is a numpy array consisting of the corresponding labels for the training samples.

batch_size=10 specifies how many training samples should be sent to the model at once.

epochs=20 means that the complete training set (all of the samples) will be passed to the model a total of 20 times.

shuffle=True indicates that the data should first be shuffled before being passed to the model.

verbose=2 indicates how much logging we will see as the model trains.

Running this code gives us the following output:

Epoch 1/20 0s - loss: 0.6400 - acc: 0.5576
Epoch 2/20 0s - loss: 0.6061 - acc: 0.6310
Epoch 3/20 0s - loss: 0.5748 - acc: 0.7010
Epoch 4/20 0s - loss: 0.5401 - acc: 0.7633
Epoch 5/20 0s - loss: 0.5050 - acc: 0.7990
Epoch 6/20 0s - loss: 0.4702 - acc: 0.8300
Epoch 7/20 0s - loss: 0.4366 - acc: 0.8495
Epoch 8/20 0s - loss: 0.4066 - acc: 0.8767
Epoch 9/20 0s - loss: 0.3808 - acc: 0.8814
Epoch 10/20 0s - loss: 0.3596 - acc: 0.8962
Epoch 11/20 0s - loss: 0.3420 - acc: 0.9043
Epoch 12/20 0s - loss: 0.3282 - acc: 0.9090
Epoch 13/20 0s - loss: 0.3170 - acc: 0.9129
Epoch 14/20 0s - loss: 0.3081 - acc: 0.9210
Epoch 15/20 0s - loss: 0.3014 - acc: 0.9190
Epoch 16/20 0s - loss: 0.2959 - acc: 0.9205
Epoch 17/20 0s - loss: 0.2916 - acc: 0.9238
Epoch 18/20 0s - loss: 0.2879 - acc: 0.9267
Epoch 19/20 0s - loss: 0.2848 - acc: 0.9252
Epoch 20/20 0s - loss: 0.2824 - acc: 0.9286

The output gives us the following values for each epoch:

1. Epoch number
2. Duration in seconds
3. Loss
4. Accuracy

What you will notice is that the loss is going down and the accuracy is going up as the epochs progress.

This is the general method for training models in Keras. I hope you now have a general understanding of the training process, how our models learn, and how this can be done in code with Keras. I'll see you in the next one!

Description