Keras with TensorFlow - Data Processing for Neural Network Training
text
Data Processing for Neural Network Training
In this episode, we'll demonstrate how to process numerical data that we'll later use to train our very first artificial neural network.
Samples and labels
To train any neural network in a supervised learning task, we first need a data set of samples and the corresponding labels for those samples.
When referring to samples, we're just referring to the underlying data set, where each individual item or data point within that set is called a sample. Labels are the corresponding labels for the samples.
If we were to train a model to do sentiment analysis on headlines from a media source, for example, the corresponding label for each sample headline from the media source could be “positive” or “negative.”
If we were training a model on images of cats and dogs, then the label for each of the images would either be “cat” or “dog.”
Note that in deep learning, samples are also commonly referred to as input data or inputs, and labels are also commonly referred to as target data or targets.
Expected data format
When preparing data, we first need to understand the format that the data need to be in for the end goal we have in mind. In our case, we want our data to be in a format that we can pass to a neural network model.
The first model we'll build in an upcoming episode will be a Sequential model from the Keras API integrated within TensorFlow. We'll discuss the details of this type of model in that future episode, but for now, we just need to understand the type of data that is expected by a Sequential model.
The Sequential model receives data during training, which occurs when we call the fit()
function on the model. Therefore, we need to check the type of data this function expects.
Per the
documentation of the fit() function, the input data x
need to be one of the following data types.
- A Numpy array (or array-like), or a list of arrays (in case the model has multiple inputs).
- A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).
- A dict mapping input names to the corresponding array/tensors, if the model has named inputs.
-
A
tf.data
dataset. Should return a tuple of either(inputs, targets)
or(inputs, targets, sample_weights)
. -
A generator or
keras.utils.Sequence
returning(inputs, targets)
or(inputs, targets, sample_weights
).
So, when we aggregate our data, we need to ensure that it is contained in one of the above types of data structures. The corresponding labels y
for the data are expected to be formatted similarly.
Like the input data x
, the corresponding label data y
can also either be a Numpy array(s) or TensorFlow tensor(s). Note, y
should be consistent with x
.
We cannot have Numpy samples and tensor labels, or vice-versa.
Note that if x
is a dataset, generator, or keras.utils.Sequence
instance, y
should not be specified (since labels will be obtained from x
). We'll
see examples of this later in the course.
Aside from formatting the data to make the it meet the format required by the model, another reason to format or process the data is to transform it in such a way that it that may make it easier, faster, or more efficient for the network to learn from. We can do this through data normalization or standardization techniques.
Process data in code
Data processing for deep learning will vary greatly depending on the type of data we're working with and the type of task we'll be using the network for.
We'll start out with a very simple classification task using a simple numerical data set. Later in the course, we'll work with other types of data and other tasks.
We first need to import the libraries we'll be working with.
import numpy as np
from random import randint
from sklearn.utils import shuffle
from sklearn.preprocessing import MinMaxScaler
Next, we create two empty lists. One will hold the input data, the other will hold the target data.
train_labels = [] train_samples = []
Data Creation
For this simple task, we'll be creating our own example data set.
As motivation for this data, let's suppose that an experimental drug was tested on individuals ranging from age 13
to 100
in a clinical trial. The trial had 2100
participants. Half of the participants were under 65
years old, and the other half was 65
years of age or older.
The trial showed that around 95%
of patients 65
or older experienced side effects from the drug, and around 95%
of patients under 65
experienced no side
effects, generally showing that elderly individuals were more likely to experience side effects.
Ultimately, we want to build a model to tell us whether or not a patient will experience side effects solely based on the patient's age. The judgement of the model will be based on the training data.
Note that with the simplicity of the data along with the conclusions drawn from it, a neural network may be overkill, but understand this is just to first get introduced to working with data for deep learning, and later, we'll be making use of more advanced data sets.
The block of code below shows how to generate this dummy data.
for i in range(50):
# The ~5% of younger individuals who did experience side effects
random_younger = randint(13,64)
train_samples.append(random_younger)
train_labels.append(1)
# The ~5% of older individuals who did not experience side effects
random_older = randint(65,100)
train_samples.append(random_older)
train_labels.append(0)
for i in range(1000):
# The ~95% of younger individuals who did not experience side effects
random_younger = randint(13,64)
train_samples.append(random_younger)
train_labels.append(0)
# The ~95% of older individuals who did experience side effects
random_older = randint(65,100)
train_samples.append(random_older)
train_labels.append(1)
This code creates 2100
samples and stores the age of the individuals in the train_samples
list and stores whether or not the individuals experienced side effects in the train_labels
list.
This is what the train_samples
data looks like.
for i in train_samples:
print(i)
49
94
31
83
13
...
This is just ages ranging anywhere from 13
to 100
years old.
This is what the train_labels
look like.
for i in train_labels:
print(i)
0
1
0
1
0
...
A 0
indicates that an individual did not experience a side effect, and a 1
indicates that an individual did experience a side effect.
Data processing
We now convert both lists into numpy arrays due to what we discussed the fit()
function expects, and we then shuffle the arrays to remove any order that was imposed on the data during the creation
process.
train_labels = np.array(train_labels) train_samples = np.array(train_samples) train_labels, train_samples = shuffle(train_labels, train_samples)
In this form, we now have the ability to pass the data to the model because it is now in the required format, however, before doing that, we'll first scale the data down to a range from 0
to 1
.
We'll use scikit-learn's MinMaxScaler
class to scale all of the data down from a scale ranging from 13
to
100
to be on a scale from 0
to 1
.
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform(train_samples.reshape(-1,1))
We reshape the data as a technical requirement just since the fit_transform()
function doesn't accept 1D
data by default.
To further understand why we would want to do this step of scaling down the data in this way, check out the first half of the episode on batch normalization from the Deep Learning Fundamentals course where we discuss standardization and normalization techniques.
Now that the data has been scaled, let's iterate over the scaled data to see what it looks like now.
for i in scaled_train_samples:
print(i)
[0.47126437]
[0.60919540]
[0.06896552]
[0.90804598]
[0.35632184]
...
As expected, all of the data has been transformed to numbers between 0
and 1
.
At this point, we've generated some sample raw data, put it into the numpy format that our model will require, and rescaled it to a scale ranging from 0
to 1
.
In an upcoming episode, we'll use this data to train a neural network and see what kind of results we can get.
quiz
resources
updates
Committed by on