This lesson is being piloted (Beta version)

Deep Learning

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What is Deep Learning?

  • What is a Neural Network?

  • Why DL is often computed with accelerators (GPUs)

Objectives
  • Learn the basics of Deep Learning and Neural Networks

  • Learn how to use GPUs to train neural networks on an HPC cluster

Deep Learning is one of the techniques in Machine Learning with most success in a variety of problems. From Classification to Regression. Its ability to account for complexity is remarkable.

Deep Learning is one of the most active areas in Machine Learning today. However, Neural networks have been around for decades, so the question is why now? As we will learn in this workshop the hype created around Deep Learning in the last decade is due to the convergence of a set of elements that propelled Deep Learning from an obscure topic in Machine Learning to the driving force of many of our activities in daily life. We can summarize those elements as:

Every time you are using your phone or smart TV, computer, neural networks are working for you. They have become almost both ubiquitous and invisible as electricity became from the beginning of the 20th century. In a way, Machine Learning and Deep Learning are the fuel of some sort of new industrial revolution or at least with the potential to become something like it.

Deep Learning, Machine Learning, and Artificial Intelligence

Deep Learning is part of the area of Machine Learning, a set of techniques to produce models that use examples to build the model, rather than using a hardcoded algorithm. Machine Learning itself is just an area of Artificial Intelligence, an area of computer science dedicated to studying how computers can perform tasks usually considered as intellectual.

Learning from data, a scientific perspective

The idea of learning from data is not foreign to scientist. The objective of science is to create models that explain phenomena and provides predictions that can be confirmed or rejected by experiments or observations.

Scientists create models that not only give us insight about nature but also equations that allow us to make predictions. In most cases, clean equations are simply not possible and we have to use numerical approximations but we try to keep the understanding. Machine Learning is used in cases where mathematical models are known, numerical approximations are not feasible, and we We are satisfied with the answers even if we lost the ability to understand why the parameters of Machine Learning models work the way they do.

In summary, we need 3 conditions for using Machine Learning on a problem:

Kepler, an example of learning from data

The two shores of Deep Learning

There are two ways to approach Deep Learning. The biological side and the mathematical side.

An Artificial Neural Network (ANN) is a computational model that is loosely inspired by the way biological neural networks in the animal brain process information.

From the other side, they can also be considered as a generalization of the Perceptron idea. A mathematical model for producing functions capable of getting close to a target function via an iterative process.

Both shores serve as good analogies if we are careful not to extrapolate the original ideas beyond their regime of validity. Deep Learning is not pretending to be models of the brain and the complexity of Deep Neural networks is far beyond that of what perceptrons were capable of doing.

Let’s explore these two approaches for a moment.

Biological Neural Networks

From one side the idea of simulating synapsis in biological Neural Networks and using the knowledge about activation barriers and multiple connectivities as inspiration to create an Artificial Neural Network. The basic computational unit of the brain is a neuron. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10¹⁴ — 10¹⁵ synapses

Biological to Artificial Neural Networks

The idea with ANN is that synaptic strengths (the weights w in our mathematical model) are learnable and control the strength of influence and its direction: excitatory (positive weight) or inhibitory (negative weight) of one neuron on another. If the final sum of the different connections is above a certain threshold, the neuron can fire, sending a spike along its axon, which is the output of the network under the provided input.

The Perceptron

The other origin is the idea of Perceptron in Machine Learning.

Perceptron is a linear classifier (binary) and used in supervised learning. It helps to classify the given input data. As a binary classifier, it can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.

Complexity of Neural Networks

Complexity

When the neural network has no hidden layer it is just a linear classifier or Perceptron. When hidden layers are added the NN can account for non-linearity, in the case of multiple hidden layers have what is called a Deep Neural Network.

In practice how complex it can be

A Deep Learning model can include:

For example, the input can be an image with thousands of pixels and 3 colors for each pixel. Hundreds of hidden layers and the output could also be an image. That complexity is responsible for the computational cost of running such networks.

Non-linear transformations with Deep Networks

Consider this ilustrative example from this blog: We have two curves, as they are there is no way of creating a straight line that separates both curves. There is however a curve capable of separating the space in two regions where each curve lives on its region. Neural networks approach the problem by transforming the space in a non-linear way, allowing the two curves to be easily separable with a simple line.


Dynamic visualization of the transformations

In its basic form, a neural network consists of the application of affine transformations (scalings, skewings and rotations, and translations) followed by pointwise application of a non-linear function:


Basic Neural Network Architecture

Basic Architecture

Neural Network Zoo

Since neural networks are one of the more active research fields in machine learning, a large number of modifications have been proposed. In the following figure, a summary of the different node structures is drawn and from that, relations and acronyms are provided such that some of the different networks are related someway. The figure below shows a summary but let me give you a quick overview of a few of them.

1) Feed forward neural networks (FF or FFNN) and perceptrons (P). They feed information from the front to the back (input and output, respectively). Neural networks are often described as having layers, where each layer consists of either input, hidden, or output cells in parallel. A layer alone never has connections and in general two adjacent layers are fully connected (every neuron from one layer to every neuron to another layer). One usually trains FFNNs through back-propagation, giving the network paired datasets of “what goes in” and “what we want to have coming out”. Given that the network has enough hidden neurons, it can theoretically always model the relationship between the input and output. Practically their use is a lot more limited but they are popularly combined with other networks to form new networks.

2) Radial basis functions. This network is simpler than the normal FFNN, as the activation function is a radial function. Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set. When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype. Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

3) Recurrent Neural Networks (RNN). These networks are designed to take a series of inputs with no predetermined limit on size. They are networks with loops in them, allowing information to persist.

4) Long/short term memory (LSTM) networks. There is a special kind of RNN where each neuron has a memory cell and three gates: input, output, and forget. The idea of each gate is to allow or stop the flow of information through them.

5) Gated recurrent units (GRU) are a slight variation on LSTMs. They have one less gate and are wired slightly differently: instead of an input, output, and forget gate, they have an update gate.

6) Convolutional Neural Networks (ConvNet) is very similar to FFNN, they are made up of neurons that have learnable weights and biases. In a convolutional neural network (CNN, or ConvNet or shift invariant or space invariant) the unit connectivity pattern is inspired by the organization of the visual cortex, Units respond to stimuli in a restricted region of space known as the receptive field. Receptive fields partially overlap, over-covering the entire visual field. The unit response can be approximated mathematically by a convolution operation. They are variations of multilayer perceptrons that use minimal preprocessing. Their wide applications are in image and video recognition, recommender systems, and natural language processing. CNN’s requires large data to train on.

From http://www.asimovinstitute.org/neural-network-zoo

Neural Network Zoo

Activation Function

Each internal neuron receives input from several other neurons, computes the aggregate, and propagates the result based on the activation function.

Neural Network

Activation Function

Neurons apply activation functions at these summed inputs.

Activation functions are typically non-linear.

Inference or Foward Propagation

Input Hidden Output
Receiving Input Computing Hidden Layer Computing Output

Receiving Input

Hidden Layer

Output Layer

In terms of Linear Algebra:

\[\begin{bmatrix} 1 & -2 & 2\\ 2 & 1 & -4\\ 1 & -1 & 0\\ \end{bmatrix} \cdot \begin{bmatrix} 0.5\\ 0.9\\ -0.3\\ \end{bmatrix} = \begin{bmatrix} -1.9\\ 3.1\\ -0.4\\ \end{bmatrix}\]

The fact that we can describe the problem in terms of Linear Algebra is one of the reasons why Neural Networks are so efficient on GPUs. The same operation as a single execution line looks like this:

Biases

It is also very useful to be able to offset our inputs by some constant. You can think of this as centering the activation function or translating the solution (next slide). We will call this constant the bias, and there will often be one value per layer.

Accounting for Non-Linearity

Neural networks are so effective in classification and regression due to their ability to combine linear and non-linear operations on each step of the evaluation.

Training Neural Networks: The backpropagation

During training, once we have forward propagated the network, we will find that the final output differs from the known output. The weights must need to be modified to produce better results in the next attempt.

How do we know which new weights? to use?

We want to minimize the error on our training data. Given labeled inputs, select weights that generate the smallest average error on the outputs. We know that the output is a function of the weights:

\[E(w_1,w_2,w_3,...i_1,...t_1,...)\]

Just remember that the response of a single neuron can be written as

\[f (b + \sum_{i=1}^N a_i w_i),\]

where the \(a_i\) is the output of the previous layer (or the input if it is the second layer) and \(w_i\) are the weights. So to figure out which way, we need to change any particular weight, say \(w_3\), we want to calculate

\[\frac{\partial E}{\partial \{w,i,t\}}\]

If we use the chain rule repeatedly across layers we can work our way backward from the output error through the weights, adjusting them as we go. Note that this is where the requirement that activation functions must have nicely behaved derivatives comes from.

This technique makes the weight inter-dependencies much more tractable. An elegant perspective on this can be found from Chris Olahat Blog

With basic calculus, you can readily work through the details.

You can find an excellent explanation from the renowned 3Blue1Brown

Solving the back propagation efficiently

The explicit solution for backpropagation leaves us with potentially many millions of simultaneous equations to solve (real nets have a lot of weights).

They are non-linear to boot. Fortunately, this isn’t a new problem created by deep learning, so we have options from the world of numerical methods.

The standard has been Gradient Descent local minimization algorithms.

To improve the convergence of Gradient Descent, refined methods use adaptive time step and incorporate momentum to help get over a local minimum. Momentum and step size are the two hyperparameter

The optimization problem that Gradient Descent solves is a local minimization. We don’t expect to ever find the actual global minimum. Several techniques have been created to avoid a solution being trapped in a local minima.

We could/should find the error for all the training data before updating the weights (an epoch). However, it is usually much more efficient to use a stochastic approach, sampling a random subset of the data, updating the weights, and then repeating with another. This is the mini-batch Gradient Descent

Modern Neural Network Architectures

Convolutional Neural Networks

A convolutional neural network (CNN) is a Deep Learning algorithm that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image, and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, ConvNets can learn these filters/characteristics.

CNN

As seen from this figure, CNN consists of several convolutional and subsampling layers optionally followed by fully connected layers.

Let us say that our input to the convolutional layer is a \(m \times m \times r\) pixels in an image where \(m\) is the height and width of the image and \(r\) is the number of channels, e.g. an RGB image has \(r=3\). The convolutional layer will have \(k\) filters (or kernels) of size \(n \times n \times q\) where n is smaller than the dimension of the image and \(q\) can either be the same as the number of channels r or smaller and may vary for each kernel. The size of the filters gives rise to the locally connected structure which is each convolved with the image to produce k feature maps of size \(m−n+1\).

A simple demonstration is shown in the figure below, where we assume a binary picture and a single filter of a 3x3 matrix. The primary purpose of Convolution is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. The orange square slide over the figure and for each 3x3 overlap, I multiply every element of the 3x3 submatrix of the figure with the convolution and then I add all elements afterward.

ConvNet

It is clear that different values of the filter matrix will produce different Feature Maps for the same input image.

Typical filter matrices are now described.

For edge detection: \(\begin{bmatrix} 1&0&-1\\ 0&0&0\\ -1&0&1\\ \end{bmatrix} \;\; \begin{bmatrix} 0&1&0\\ 1&-4&1\\ 0&1&0\\ \end{bmatrix} \;\; \begin{bmatrix} -1&-1&-1\\ -1&8&-1\\ -1&-1&-1\\ \end{bmatrix}\)

For sharpen: \(\begin{bmatrix} 0&-1&0\\ -1&5&-1\\ 0&-1&0\\ \end{bmatrix}\)

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as the number of filters, filter size, architecture of the network, etc. before the training process). The more filters we have, the more image features get extracted, and the better our network becomes at recognizing patterns in unseen images.

The other step that is described in this section is the pooling. Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum, etc.

Graph Neural Network

Before I end and get more into Neural networks and different packages, I would like to discuss one of the most recent proposals in the literature. The so-called Graph Neural Network.

As much of the available information in fields like social networks, knowledge graphs, recommender systems, and even life science comes in the form of graphs, very recently people have developed specific neural networks for these types of applications. Most of the discussion here has been taken from Zhou’s paper.

A Graph Neural Network is a type of Neural Network which directly operates on the Graph structure, which we define by a set of nodes and edges \(G = (V, E)\). A typical application of GNN is node classification. Essentially, every node in the graph is associated with a label, and we want to predict the label of the nodes without ground truth. Here we describe briefly this application and let the reader search for more information.

In the node classification problem setup, each node \(V\) is characterized by its feature \(x_v\) and associated with a ground-truth label \(t_v\). Given a partially labeled graph \(G\), the goal is to leverage these labeled nodes to predict the labels of the unlabeled. It learns to represent each node with a \(d\) dimensional vector (state) \(\vec{h}_V\) which contains the information of its neighborhood. The state embedding \(\vec{h}_V\) is an \(s\)-dimension vector of node \(V\) and can be used to produce an output \(\vec{o}_V\) such as the node label. Let \(f\) be a parametric function, called a local transition function, that is shared among all nodes and updates the node state according to the input neighborhood, and let \(g\) be the local output function that describes how the output is produced. Then, \(\vec{h}_V\) and \(\vec{o}_V\) are defined as follows:

\[\vec{h}_V = f ( \vec{x}_V, \vec{x}_{CO[V]}, \vec{h}_{ne[V]}, \vec{x}_{ne[V]})\] \[\vec{o}_V = g(\vec{h}_V, \vec{x}_V)\]

where \(\vec{x}_V\), \(\vec{x}_{co[V]}\), \(\vec{h}_{ne[V]}\), \(\vec{x}_{ne[V]}\) are the features of \(V\), the features of its edges, the states, and the features of the nodes in the neighborhood of \(V\), respectively.

Let \(\vec{H}\), \(\vec{O}\) , \(\vec{X}\) , and \(\vec{X}_N\) be the vectors constructed by stacking all the states, all the outputs, all the features, and all the node features, respectively. Then we have a compact form as: \(\vec{H} = F (\vec{H}, \vec{X})\) \(\vec{O} = G(\vec{H},\vec{X}_N)\) where \(F\), the global transition function, and \(G\), the global output function are stacked versions of f and g for all nodes in a graph, respectively. The value of \(\vec{H}\) is the fixed point of Eq. 3 and is uniquely defined with the assumption that \(F\) is a contraction map. Since we are seeking a unique solution for \(\vec{h}_v\), we can apply Banach fixed point theorem and rewrite the above equation as an iteratively update process. Such operation is often referred to as message passing or neighborhood aggregation. \(\vec{H}^{t+1} = F (\vec{H}^t, \vec{X})\) where \(\vec{H}\) and \(\vec{X}\) denote the concatenation of all the \(\vec{h}\) and \(\vec{x}\), respectively. The output of the GNN is computed by passing the state h_v as well as the feature \(x_v\) to an output function g. \(\vec{o} = g(\vec{h}_V,\vec{x}_V)\)

More details on this methodology can be found in the paper above.

Generative Adversarial Networks

Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other (thus the “adversarial”). Introduced in 2014 by Goodfellow and other people.

GANs’ potential is huge because they can learn to mimic any distribution of data. That is, GANs can be taught to create worlds eerily similar to our own in any domain: images, music, speech, prose.

To understand GANs, you should know how generative algorithms work. Up to now, most of the algorithms we have discussed are the so-called discriminative algorithms, where we try to predict the output from a given set of features.

In the Bayesian language, we are trying to predict

\(P(c_j ; x_1,x_2,\cdots,x_n)\).

In GANs, we are concerned with a different idea,

We try two features given a certain label.

Therefore, here we would like to build

\(P(x_1,x_2,\cdots,x_n ; c_j)\).

The idea in GANs then is to have two neural networks. One is called the generator, which generates features, and the other network, the discriminator evaluates its authenticity, i.e. the discriminator decides whether each instance of data that it reviews belongs to the actual training dataset or not. For example, we try to analyze a book by a great author (for example Garcia Marquez). We could analyze the language used in his texts but for this example, the generator should be able to create words and the discriminator should be able to recognize if these are authentic. The idea of the generator then is to create words that were not created by Garcia Marquez but that the discriminator is unable to distinguish.

Milestones of Neural Networks and Deep Learning Frameworks

The history

Artificial Intelligence is born

The term Artificial Intelligence (AI) was first coined by John McCarthy in 1956. In the early days, the focus was on hard coding rules that computers can follow via inference. That was called the age on symbolic AI and let to the expert systems of the 80s and early 90s.

Deep Learning old names

Deep Learning has received different names over time. It was called Cybernetics in the 70s, Connectionism in the 80s. Neural Networks in the 90s and today we call it Deep Learning

First DL Frameworks from 2000’s to 2012

AlexNet in 2012

Big Tech companies jump in

ResNet in 2016

Consolidating Deep Learning Frameworks

Doing Deep Learning at large

Key Points

  • Deep Learning are techniques to explore multiple level of data transformations

  • Deep Learning is usually understood as the use of neural network models

  • Neural network models can be trained efficiently using GPUs

  • Three popular Deep Learning models are TensorFlow, PyTorch and MXNet