Keras with TensorFlow - Data Processing for Neural Network Training

video

expand_more

text

expand_more

Data Processing for Neural Network Training

In this episode, we'll demonstrate how to process numerical data that we'll later use to train our very first artificial neural network.

Samples and labels

To train any neural network in a supervised learning task, we first need a data set of samples and the corresponding labels for those samples.

When referring to samples, we're just referring to the underlying data set, where each individual item or data point within that set is called a sample. Labels are the corresponding labels for the samples.

If we were to train a model to do sentiment analysis on headlines from a media source, for example, the corresponding label for each sample headline from the media source could be “positive” or “negative.”

If we were training a model on images of cats and dogs, then the label for each of the images would either be “cat” or “dog.”

Note that in deep learning, samples are also commonly referred to as input data or inputs, and labels are also commonly referred to as target data or targets.

Expected data format

When preparing data, we first need to understand the format that the data need to be in for the end goal we have in mind. In our case, we want our data to be in a format that we can pass to a neural network model.

The first model we'll build in an upcoming episode will be a Sequential model from the Keras API integrated within TensorFlow. We'll discuss the details of this type of model in that future episode, but for now, we just need to understand the type of data that is expected by a Sequential model.

The Sequential model receives data during training, which occurs when we call the fit() function on the model. Therefore, we need to check the type of data this function expects.

Per the documentation of the fit() function, the input data x need to be one of the following data types.

A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).
A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).
A dict mapping input names to the corresponding array/tensors, if the model has named inputs.
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
A generator or keras.utils.Sequence returning (inputs, targets) or (inputs, targets, sample_weights).

So, when we aggregate our data, we need to ensure that it is contained in one of the above types of data structures. The corresponding labels y for the data are expected to be formatted similarly.

Like the input data x, the corresponding label data y can also either be a Numpy array(s) or TensorFlow tensor(s). Note, y should be consistent with x. We cannot have Numpy samples and tensor labels, or vice-versa.

Note that if x is a dataset, generator, or keras.utils.Sequence instance, y should not be specified (since labels will be obtained from x). We'll see examples of this later in the course.

Aside from formatting the data to make the it meet the format required by the model, another reason to format or process the data is to transform it in such a way that it that may make it easier, faster, or more efficient for the network to learn from. We can do this through data normalization or standardization techniques.

Process data in code

Data processing for deep learning will vary greatly depending on the type of data we're working with and the type of task we'll be using the network for.

We'll start out with a very simple classification task using a simple numerical data set. Later in the course, we'll work with other types of data and other tasks.

We first need to import the libraries we'll be working with.

import numpy as np
from random import randint
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler

Next, we create two empty lists. One will hold the input data, the other will hold the target data.

train_labels = []
train_samples = []

Data Creation

For this simple task, we'll be creating our own example data set.

As motivation for this data, let's suppose that an experimental drug was tested on individuals ranging from age 13 to 100 in a clinical trial. The trial had 2100 participants. Half of the participants were under 65 years old, and the other half was 65 years of age or older.

The trial showed that around 95% of patients 65 or older experienced side effects from the drug, and around 95% of patients under 65 experienced no side effects, generally showing that elderly individuals were more likely to experience side effects.

Ultimately, we want to build a model to tell us whether or not a patient will experience side effects solely based on the patient's age. The judgement of the model will be based on the training data.

Note that with the simplicity of the data along with the conclusions drawn from it, a neural network may be overkill, but understand this is just to first get introduced to working with data for deep learning, and later, we'll be making use of more advanced data sets.

The block of code below shows how to generate this dummy data.

for i in range(50):
    # The ~5% of younger individuals who did experience side effects
    random_younger = randint(13,64)
    train_samples.append(random_younger)
    train_labels.append(1)

    # The ~5% of older individuals who did not experience side effects
    random_older = randint(65,100)
    train_samples.append(random_older)
    train_labels.append(0)

for i in range(1000):
    # The ~95% of younger individuals who did not experience side effects
    random_younger = randint(13,64)
    train_samples.append(random_younger)
    train_labels.append(0)

    # The ~95% of older individuals who did experience side effects
    random_older = randint(65,100)
    train_samples.append(random_older)
    train_labels.append(1)

This code creates 2100 samples and stores the age of the individuals in the train_samples list and stores whether or not the individuals experienced side effects in the train_labels list.

This is what the train_samples data looks like.

for i in train_samples:
    print(i)

This is just ages ranging anywhere from 13 to 100 years old.

This is what the train_labels look like.

for i in train_labels:
    print(i)

0
1
0
1
0
...

A 0 indicates that an individual did not experience a side effect, and a 1 indicates that an individual did experience a side effect.

Data processing

We now convert both lists into numpy arrays due to what we discussed the fit() function expects, and we then shuffle the arrays to remove any order that was imposed on the data during the creation process.

train_labels = np.array(train_labels)
train_samples = np.array(train_samples)
train_labels, train_samples = shuffle(train_labels, train_samples)

In this form, we now have the ability to pass the data to the model because it is now in the required format, however, before doing that, we'll first scale the data down to a range from 0 to 1.

We'll use scikit-learn's MinMaxScaler class to scale all of the data down from a scale ranging from 13 to 100 to be on a scale from 0 to 1.

scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform(train_samples.reshape(-1,1))

We reshape the data as a technical requirement just since the fit_transform() function doesn't accept 1D data by default.

To further understand why we would want to do this step of scaling down the data in this way, check out the first half of the episode on batch normalization from the Deep Learning Fundamentals course where we discuss standardization and normalization techniques.

Now that the data has been scaled, let's iterate over the scaled data to see what it looks like now.

for i in scaled_train_samples:
    print(i)

[0.47126437]
[0.60919540]
[0.06896552]
[0.90804598]
[0.35632184]
...

As expected, all of the data has been transformed to numbers between 0 and 1.

At this point, we've generated some sample raw data, put it into the numpy format that our model will require, and rescaled it to a scale ranging from 0 to 1.

In an upcoming episode, we'll use this data to train a neural network and see what kind of results we can get.

quiz

expand_more

resources

expand_more

In this episode, we'll demonstrate how to process numerical data that we'll later use to train our very first artificial neural network. 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 10:45 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 https://youtube.com/deeplizardvlog 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 https://neurohacker.com/shop?rfsn=6488344.d171c6 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard 🎓 Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more

DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

Updated
Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.

TensorFlow - Python Deep Learning Neural Network API