Train, Test, & Validation Sets explained

video

expand_more

text

expand_more

Datasets for deep learning

In this post, we'll discuss the different datasets used during training and testing a neural network.

For training and testing purposes for our model, we should have our data broken down into three distinct datasets. These datasets will consist of the following:

Training set
Validation set
Test set

Let's start by discussing the training set.

Training set

The training set is what it sounds like. It's the set of data used to train the model. During each epoch, our model will be trained over and over again on this same data in our training set, and it will continue to learn about the features of this data.

The hope with this is that later we can deploy our model and have it accurately predict on new data that it's never seen before. It will be making these predictions based on what it's learned about the training data. Ok, now let's discuss the validation set.

Validation set

The validation set is a set of data, separate from the training set, that is used to validate our model during training. This validation process helps give information that may assist us with adjusting our hyperparameters.

Recall how we just mentioned that with each epoch during training, the model will be trained on the data in the training set. Well, it will also simultaneously be validated on the data in the validation set.

We know from our previous posts on training, that during the training process, the model will be classifying the output for each input in the training set. After this classification occurs, the loss will then be calculated, and the weights in the model will be adjusted. Then, during the next epoch, it will classify the same input again.

Now, also during training, the model will be classifying each input from the validation set as well. It will be doing this classification based only on what it's learned about the data it's being trained on in the training set. The weights will not be updated in the model based on the loss calculated from our validation data.

Remember, the data in the validation set is separate from the data in the training set. So when the model is validating on this data, this data does not consist of samples that the model already is familiar with from training.

One of the major reasons we need a validation set is to ensure that our model is not overfitting to the data in the training set. We'll discuss overfitting and underfitting in detail at a later time. But the idea of overfitting is that our model becomes really good at being able to classify data in the training set, but it's unable to generalize and make accurate classifications on data that it wasn't trained on.

During training, if we're also validating the model on the validation set and see that the results it's giving for the validation data are just as good as the results it's giving for the training data, then we can be more confident that our model is not overfitting.

The validation set allows us to see how well the model is generalizing during training.

On the other hand, if the results on the training data are really good, but the results on the validation data are lagging behind, then our model is overfitting. Now let's move on to the test set.

Test set

The test set is a set of data that is used to test the model after the model has already been trained. The test set is separate from both the training set and validation set.

After our model has been trained and validated using our training and validation sets, we will then use our model to predict the output of the unlabeled data in the test set.

One major difference between the test set and the two other sets is that the test set should not be labeled. The training set and validation set have to be labeled so that we can see the metrics given during training, like the loss and the accuracy from each epoch.

When the model is predicting on unlabeled data in our test set, this would be the same type of process that would be used if we were to deploy our model out into the field.

The test set provides a final check that the model is generalizing well before deploying the model to production.

For example, if we're using a model to classify data without knowing what the labels of the data are beforehand, or with never have being shown the exact data it's going to be classifying, then of course we wouldn't be giving our model labeled data to do this.

The entire goal of having a model be able to classify is to do it without knowing what the data is beforehand.

The ultimate goal of machine learning and deep learning is to build models that are able to generalize well.

Deep learning datasets in summary

The table below summarizes deep learning datasets:

Deep Learning Datasets
Dataset	Updates Weights	Description
Training set	Yes	Used to train the model. The goal of training is to fit the model to the training set while still generalizing to unseen data.
Validation set	No	Used during training to check how well the model is generalizing.
Test set	No	Used to test the model's final ability to generalize before deploying to production.

Now hopefully we have an idea about how our data should be organized in terms of datasets and how each of these datasets are used.

The main reason for having three separate datasets is to ensure that the model is able to generalize by predicting accurately on unseen data. When the model is failing to generalize, we are usually in a situation of overfitting or underfitting. We'll look at these in the next one. I'll see ya there!

quiz

expand_more

resources

expand_more

In this video, we explain the concept of the different data sets used for training and testing an artificial neural network, including the training set, testing set, and validation set. We also show how to create and specify these data sets in code with Keras. 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 06:28 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 https://youtube.com/deeplizardvlog 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 https://neurohacker.com/shop?rfsn=6488344.d171c6 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard 🎓 Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more

DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

Updated
Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.

Deep Learning Fundamentals - Classic Edition