CNN Image Preparation Code Project - Learn to Extract, Transform, Load (ETL)

video

expand_more

text

expand_more

Extract, Transform, and Load (ETL) with PyTorch

Welcome back to this series on neural network programming with PyTorch. In this post, we will write our first code of part two of the series.

We'll demonstrate a very simple extract, transform and load pipeline using torchvision, PyTorch's computer vision package for machine learning. Without further ado, let's get started.

The project (Bird's-eye view)

There are four general steps that we'll be following as we move through this project:

Prepare the data
Build the model
Train the model
Analyze the model's results

The ETL process

In this post, we'll kick things off by preparing the data. To prepare our data, we'll be following what is loosely known as an ETL process.

Extract data from a data source.
Transform data into a desirable format.
Load data into a suitable structure.

The ETL process can be thought of as a fractal process because it can be applied on various scales. The process can be applied on a small scale, like a single program, or on a large scale, all the way up to the enterprise level where there are huge systems handling each of the individual parts.

If you want to know more about the general data science pipeline, check out the data science post, where we cover this in greater detail.

Once we have completed the ETL process, we are ready to begin building and training our deep learning model. PyTorch has some built-in packages and classes that make the ETL process pretty easy.

PyTorch imports

We begin by importing all of the necessary PyTorch libraries.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import torchvision
import torchvision.transforms as transforms

This table describes the of each of these packages:

Package	Description
torch	The top-level PyTorch package and tensor library.
torch.nn	A subpackage that contains modules and extensible classes for building neural networks.
torch.optim	A subpackage that contains standard optimization operations like SGD and Adam.
torch.nn.functional	A functional interface that contains typical operations used for building neural networks like loss functions and convolutions.
torchvision	A package that provides access to popular datasets, model architectures, and image transformations for computer vision.
torchvision.transforms	An interface that contains common transforms for image processing.

Other imports

The next imports are standard packages used for data science in Python:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix
#from plotcm import plot_confusion_matrix

import pdb

torch.set_printoptions(linewidth=120)

Note that pdb is the Python debugger and the commented import is a local file that we'll introduce in future posts for plotting the confusion matrix, and the last line sets the print options for PyTorch print statements.

We are ready now to prepare our data.

Preparing our data using PyTorch

Our ultimate goal when preparing our data is to do the following (ETL):

Extract – Get the Fashion-MNIST image data from the source.
Transform – Put our data into tensor form.
Load – Put our data into an object to make it easily accessible.

For these purposes, PyTorch provides us with two classes:

Class	Description
torch.utils.data.Dataset	An abstract class for representing a dataset.
torch.utils.data.DataLoader	Wraps a dataset and provides access to the underlying data.

An abstract class is a Python class that has methods we must implement, so we can create a custom dataset by creating a subclass that extends the functionality of the Dataset class.

To create a custom dataset using PyTorch, we extend the Dataset class by creating a subclass that implements these required methods. Upon doing this, our new subclass can then be passed to the a PyTorch DataLoader object.

We will be using the fashion-MNIST dataset that comes built-in with the torchvision package, so we won't have to do this for our project. Just know that the Fashion-MNIST built-in dataset class is doing this behind the scenes.

All subclasses of the Dataset class must override __len__, that provides the size of the dataset, and __getitem__, supporting integer indexing in range from 0 to len(self) exclusive.

Specifically, there are two methods that are required to be implemented. The __len__ method which returns the length of the dataset, and the __getitem__ method that gets an element from the dataset at a specific index location within the dataset.

PyTorch torchvision package

The torchvision package, gives us access to the following resources:

Datasets (like MNIST and Fashion-MNIST)
Models (like VGG16)
Transforms
Utils

Computer vision

All of these resources are related to deep learning computer vision tasks.

When we learned about the Fashion-MNIST dataset in our previous post, the arXiv paper that introduced the fashion dataset indicated that the authors wanted it to be a drop-in for the original MNIST dataset.

The idea was to make is so that frameworks like PyTorch could add Fashion-MNIST by just changing the URL for retrieving the data.

This is the case for PyTorch. The PyTorch FashionMNIST dataset simply extends the MNIST dataset and overrides the urls.

Here is the class definition from PyTorch's torchvision source code:

class FashionMNIST(MNIST):
    """`Fashion-MNIST `_ Dataset.

    Args:
        root (string): Root directory of dataset where ``processed/training.pt``
            and  ``processed/test.pt`` exist.
        train (bool, optional): If True, creates dataset from ``training.pt``,
            otherwise from ``test.pt``.
        download (bool, optional): If true, downloads the dataset from the internet and
            puts it in root directory. If dataset is already downloaded, it is not
            downloaded again.
        transform (callable, optional): A function/transform that  takes in an PIL image
            and returns a transformed version. E.g, ``transforms.RandomCrop``
        target_transform (callable, optional): A function/transform that takes in the
            target and transforms it.
    """
    urls = [
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz',
        'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz',
    ]

Let's see now how we can take advantage of torchvision.

PyTorch Dataset class

To get an instance of the FashionMNIST dataset using torchvision, we just create one like so:

train_set = torchvision.datasets.FashionMNIST(
    root='./data'
    ,train=True
    ,download=True
    ,transform=transforms.Compose([
        transforms.ToTensor()
    ])
)

Note that the root argument used to be './data/FashionMNIST', however, it has since changed due to torchvision updates.

We specify the following arguments:

Parameter	Description
root	The location on disk where the data is located.
train	If the dataset is the training set
download	If the data should be downloaded.
transform	A composition of transformations that should be performed on the dataset elements.

Since we want our images to be transformed into tensors, we use the built-in transforms.ToTensor() transformation, and since this dataset is going to be used for training, we'll name the instance train_set.

When we run this code for the first time, the Fashion-MNIST dataset will be downloaded locally. Subsequent calls check for the data before downloading it. Thus, we don't have to worry about double downloads or repeated network calls.

PyTorch DataLoader class

To create a DataLoader wrapper for our training set, we do it like this:

train_loader = torch.utils.data.DataLoader(train_set
    ,batch_size=1000
    ,shuffle=True
)

We just pass train_set as an argument. Now, we can leverage the loader for tasks that would otherwise be pretty complicated to implement by hand:

batch_size (1000 in our case)
shuffle (True in our case)
num_workers (Default is 0 which means the main process will be used)

ETL summary

From an ETL perspective, we have achieved the extract, and the transform using torchvision when we created the dataset:

Extract – The raw data was extracted from the web.
Transform – The raw image data was transformed into a tensor.
Load – The train_set wrapped by (loaded into) the data loader giving us access to the underlying data.

Now, we should have a good understanding of the torchvision module that is provided by PyTorch, and how we can use Datasets and DataLoaders in the PyTorch torch.utils.data package to streamline ETL tasks.

In the next post, we'll see how we can work with datasets and data loaders to access and view individual samples as well as batches of samples.

I'll see you in the next one!

quiz

expand_more

resources

expand_more

Preparing data for computer vision and artificial intelligence with PyTorch. Step one of our constitutional neural network coding project. References: Ted talk: https://youtu.be/BfDQNrVphLQ 🕒🦎 VIDEO SECTIONS 🦎🕒 00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources 00:30 Help deeplizard add video timestamps - See example in the description 12:25 Collective Intelligence and the DEEPLIZARD HIVEMIND 💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥 👋 Hey, we're Chris and Mandy, the creators of deeplizard! 👀 CHECK OUT OUR VLOG: 🔗 https://youtube.com/deeplizardvlog 💪 CHECK OUT OUR FITNESS CHANNEL: 🔗 https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA 🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order: 🔗 https://neurohacker.com/shop?rfsn=6488344.d171c6 ❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind: Mano Prime 👀 Follow deeplizard: Our vlog: https://youtube.com/deeplizardvlog Fitness: https://www.youtube.com/channel/UCdCxHNCexDrAx78VfAuyKiA Facebook: https://facebook.com/deeplizard Instagram: https://instagram.com/deeplizard Twitter: https://twitter.com/deeplizard Patreon: https://patreon.com/deeplizard YouTube: https://youtube.com/deeplizard 🎓 Deep Learning with deeplizard: AI Art for Beginners - https://deeplizard.com/course/sdcpailzrd Deep Learning Dictionary - https://deeplizard.com/course/ddcpailzrd Deep Learning Fundamentals - https://deeplizard.com/course/dlcpailzrd Learn TensorFlow - https://deeplizard.com/course/tfcpailzrd Learn PyTorch - https://deeplizard.com/course/ptcpailzrd Natural Language Processing - https://deeplizard.com/course/txtcpailzrd Reinforcement Learning - https://deeplizard.com/course/rlcpailzrd Generative Adversarial Networks - https://deeplizard.com/course/gacpailzrd Stable Diffusion Masterclass - https://deeplizard.com/course/dicpailzrd 🎓 Other Courses: DL Fundamentals Classic - https://deeplizard.com/learn/video/gZmobeGL0Yg Deep Learning Deployment - https://deeplizard.com/learn/video/SI1hVGvbbZ4 Data Science - https://deeplizard.com/learn/video/d11chG7Z-xk Trading - https://deeplizard.com/learn/video/ZpfCK_uHL9Y 🛒 Check out products deeplizard recommends on Amazon: 🔗 https://amazon.com/shop/deeplizard 📕 Get a FREE 30-day Audible trial and 2 FREE audio books using deeplizard's link: 🔗 https://amzn.to/2yoqWRn 🎵 deeplizard uses music by Kevin MacLeod 🔗 https://youtube.com/channel/UCSZXFhRIx6b0dFX3xS8L1yQ ❤️ Please use the knowledge gained from deeplizard content for good, not evil.

updates

expand_more

DEEPLIZARD Message notifications

Update history for this page

Did you know you that deeplizard content is regularly updated and maintained?

Updated
Maintained

Spot something that needs to be updated? Don't hesitate to let us know. We'll fix it!

All relevant updates for the content on this page are listed below.

PyTorch - Python Deep Learning Neural Network API