$ \def\W{\mathbf{W}} \def\a{\mathbf{a}} \def\b{\mathbf{b}} \def\x{\mathbf{x}} \def\y{\mathbf{y}} \def\d{\mathbf{d}} \def\z{\mathbf{z}} \def\div{\rm div} $

Understanding the Neural Network

1. What does a neural network do

A neural network is simply a function that takes in an input $X$ and computes an output $Y$. In this regard, it is no different from any other function. What sets it apart from a function such as, say $y = sin(X)$, is the manner in which it is constructed. It is composed as a network of many, very simple elements that together somehow manage to perform amazing tasks. The term “neural” itself derives from the fact that the simple units that are networked to compose the model are conceptually similar to neurons in the brain, and were in fact originally designed as models for these neurons.

2. The basic unit of the model: the “perceptron”

The basic unit of a neural network is an element, often called a neuron or a perceptron that has the following structure.

Perceptron

The parameters of the perceptron are a set of weights $w_1, w_2, \cdots, w_N$, and a bias $b$.

The actual computation performed by the perceptron is as follows:

It takes in a set of inputs $x_1$,$x_2, \cdots, x_N$.
It computes an affine combination $z$ given by \[ z = \sum_{i=1}^N w_i x_i + b \]
It applies an activation function $a()$ to $z$ to compute the output $y$ as \[ y = a(z) \]
We will generally not represent neurons with the detailed structure shown above, but simply represent it using the simplified structure below. Note that the weights, bias and activation are not explicitly shown in the figure, but are all always implied.

2.1. The activation function

The activation function $a()$ is a deceptively simple non-linear transformation that is the source of the magical capabilities of the neural network. The most commonly used activation functions are:

Sigmoid: $y = \frac{1}{1 + e^{-z}}$
Tanh: $y = \tanh(z)$
Relu: $y = \max(z,0)$
Softplus: $y = \log(1 + e^z)$
Leaky Relu: $y = \max(z, \alpha z)$, where $0 \lt \alpha \lt 1$ and is typically a small number like 0.01.
Elu: \[ y = \begin{cases} z,~~z\gt 0 \\ \alpha(e^z - 1)~~z \leq 0 \end{cases} \]

Perceptron

The figure above graphs the activation functions we have just described (for the ELU we have used $\alpha = 1$). A number of other activation functions have also been proposed in the literature.

For simple networks (like those needed in HW1), sigmoid, tanh and Relu are the ones you may want to try first.

There is also another, special activation function, not mentioned above, called the softmax activation. Unlike the activations above which take a single input $z$ and output a single value $y$, the softmax activation takes in multiple inputs simultaneously, and produces multiple outputs simultaneously. We will explain softmax activations in the next section.

3. The Multi-layer Perceptron

A multi-layer perceptron is a network of units of the kind defined above. The figure below shows a typical MLP structure.

MLP

Each circle in the figure represents a neuron. This is a multi-layer perceptron because the neurons are arranged in many layers, such that neurons from any layer only connect to the neurons from immediately adjacent layers. (We will also see networks without such layered structure in the course, but such layered architectures are the most common).

The “flow” of information in this figure is from left to right. Each layer of neurons computes its outputs, using the ouputs of the layer of neurons immediately preceding it (to its left).

The blue bullets in the figure represent the input. There are no neurons here, just the inputs; nevertheless this layer of elements is called the input layer. Again, to reiterate -- there are no neurons in the input layer. The input layer is just the location in the network from where inputs are provided to the network.

The layers between the input and the final layer are called hidden layers. That is because the outputs of these layers are not directly observed -- the actual output of the network is the output of the final layer. In this example there are three hidden layers.

The final layer (to the extreme right) is the output layer. The outputs of this layer of neurons is returned to the user. The output layer is different from other layers, so it is worth spending some time on it.

3.1. The output layer

In the figure the output layer is shown enclosed in a blue rectangular box. This is to indicate that this is not infact a set of 4 independent units. In most problems, the network is used to perform a classification task, where we must select one of $K$ classes for a given input. Rather than directly choosing the class, the network actually outputs the a posteriori probability of each of the classes (i.e. given an input $X$, it outputs $P(class=i|X)$). In this figure there are four outputs, indicating that the network is classifying between four classes. Each of the four outputs is the a posteriori probability of one of the classes. .

Since the four outputs are the probabilities for four classes, they must all sum to 1.0. In other words, the individual outputs are not independent of one another. Modifying one output (e.g. increasing it) will affect the other outputs, so that they all sum to 1.0. In order to achieve this behavior, it is not sufficient to use regular activation functions. For the output layer we must use the softmax activation mentioned earlier.

3.2. The softmax activation

The softmax activation takes multiple inputs and produces multiple outputs that sum to 1.0. Given $N$ inputs $x_1, \cdots x_N$, it computes $M$ outputs $y_1, \cdots, y_M$, where the $i^{\rm th}$ output, using the two step process shown below. In the first, it computes $M$ independent affine combinations, $z_1, \cdots, z_M$ as \[ z_i = \sum_{j=1}^N w_{ij} x_j + b_i \]

Subsequently, it computes the $M$ outputs $y_1,\cdots, y_M$ as \[ y_i = \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}} \]

Note that the outputs sum to 1.0.

The softmax activation is typically used in the final layer of a neural network, to assign probabilities to various outcomes.

4. Computing using the multi-layer perceptron

The MLP shown above can now be used to process inputs.

4.1. Representing the Input

The first step in using the MLP is to determine how the input to it is represented. MLPs are mathematical machines -- they take in numbers and output numbers. So any input provided must first be converted to a vector (or matrix) of numbers.

The task of converting your input (which could be as complex as the state of a game) to a suitable numeric vector is itself challenging. Fortunately, for simple tasks like classifying MNIST digits (as is required for HW1), you can simply arrange pixel values (which are, after all, numbers) into a vector, and voila, you have your numeric input vector.

We will deal with numeric representations of more complex inputs later in the course.

4.2. Representing the Output

The output of the network too is expected to be numeric. In a classification task, such as classifying input images of digits into one of 10 digit classes, we will require an appropriate way of representing classes numerically.

We will do so using one-hot vectors. A one-hot vector is a vector that contains a single component that has value 1, and all other components are 0. So, for instance $[0\,0\,1\,0\,0]$ is a one-hot vector.

We will represent classes as one-hot vectors. The dimensionality (number of components) of the vector will be the total number of classes. To represent the $i^{\rm th}$ class, we will use a one-hot vector whose $i^{\rm th}$ component is 1, and all other components are 0.

Thus, if the task is digit classifiction (recognize images of digits), then each digit would be represented by a 10-dimensional one-hot vector, where the digit 0 would be represented by $[1\,0\,0\,0\,0\,0\,0\,0\,0\,0]$, the digit 1 would be represented by $[0\,1\,0\,0\,0\,0\,0\,0\,0\,0]$, the digit 2 by $[0\,0\,1\,0\,0\,0\,0\,0\,0\,0]$, and so on. Ideally, when the network is presented by an image of the digit 2, the 10-dimensional output must have the one-hot value $[0\,0\,1\,0\,0\,0\,0\,0\,0\,0]$. Observe that this also has the interpretation that in the ideal case the network must assign a probability of 1 to the digit 2, and 0 to everything else.

4.3. The operation of the MLP

The MLP must take in an input $\x$ (comprising components $x_1, \cdots, x_N$) and compute the output $\y$ (comprising components $y_1, cdots, y_M)$. The computation of the MLP is preformed sequentially from the neurons closest to the input, progressing until the neurons at the output, in such a manner that when each neuron is evaluated, all of the values it requires as input are already evaluated. So, first the neurons that directly receive the input $X$ are evaluated. Then the neurons that use the values computed by these first-level neurons as inputs are evaluated, and so on.

Consider a network a layered architecture, like the one in the figure above, with several hidden layers, and final output layer composed of a softmax unit. We use the following notation.

Let $N$ be the dimensionality of the input.
Let $L$ be the number of layers of neurons (not counting the input layer).
Let $N_l$ be the number of neurons in the $l^{\rm th}$ layer.
Let $w_{i,j}^l$ be the weight of the connection from the $j^{\rm th}$ unit of the $(l-1)^{\rm th}$ layer to the $i^{\rm th}$ unit of the $l^{\rm th}$ layer (note the order of the indices).
Let $b_{i}^l$ be the bias of the $i^{\rm th}$ unit of the $l^{\rm th}$ layer.
Let $y_i^l$ be the output of the $i^{\rm th}$ unit of the $l^{\rm th}$ layer.
Let $a()$ be the activation functions used.

The computations are performed as follows.

# The first hidden layer works off the input

for $i$ = 1:$N_1$

$z_i^1 = \sum_{j=1}^N w_{ij}^1 x_j + b_i^1 \\ y_i^1 = a(z_i^l)$

end

# Subsequent hidden layers work from the output of previous layers

for $l$ = 2:$L-1$

for $i$ = 1:$N_l$

$z_i^l = \sum_{j=1}^{N_{l-1}} w_{ij}^l y_j^{l-1} + b_i^l \\ y_i^l = a(z_i^l)$

end

# The softmax of the output layer. First compute the $z$s, and then the $y$s.

for $i$ = 1:$N_L$

$z_i^L = \sum_{j=1}^{N_{L-1}} w_{ij}^L y_j^{L-1} + b_i^L$

end

for $i$ = 1:$N_L$

$y_i^L = \frac{\exp(z_i^L)}{\sum_{j=1}^{N_L} \exp(z_j^L)}$

end

Note that $y_i = y_i^L$ is the output of the network. $y_i$ is the probability assigned to the $i^{\rm th}$ class by the network. To attribute a unique class to the input, you only need to pick the most probable class, e.g. as \[ class(\x) = \arg\max_i y_i \]

4.4. The operation of the MLP in vector math

In practice, you wouldn't implement the code as it is given above. It would be too inefficient. Instead, you would use matrix and vector operations, since these can be very efficiently computed using appropriate libraries.

In order to do this in vector format, we will define the following.

The input $x_1,\cdots, x_N$ is arranged an an $N$-dimensional vector: $\x = [x_1, \cdots, x_N]^\top$.
We arrange the outputs of the $l^{\rm th}$ layer of neurons an $N_l$-dimensional vector $\y^l = [y_1^l,\cdots,y_{N_l}^l]^\top$ (for $l = 1,\cdots,L$.
The outputs of the neurons are the outputs of the activation functions at the neurons. The activation functions actually operate on the affine combinations $z_1^l,\cdots,z_{N_l}^l$. We arrange these too as a vector $\z^l=[z_1^l,\cdots,z_{N_l}^l]^\top$.
We arrange the weights for the $l^{\rm th}$ layer as a matrix: \[ \W^l = \begin{bmatrix} w_{11}^l & w_{12}^l & . & . &. & w_{1N_{l-1}}^l \\ w_{21}^l & w_{22}^l & . & . &. & w_{2N_{l-1}}^l \\ . & . & . & . &. & . \\ w_{N_l1}^l & w_{N_l2}^l & . & . &. & w_{N_lN_{l-1}}^l \end{bmatrix} \].
We arrange the biases for the $l^{\rm th}$ layer units as a vector $\b_l = [b_1^l,\cdots,b_{N_l}^l]^\top$.

We can now write out the operations required to compute the MLP much more simply as follows.

# The first hidden layer works off the input

$\z^l = \W^1\x \\ \y^1 = \a(\z^1)$

# Subsequent hidden layers work from the output of previous layers

for $l$ = 2:$L-1$

$\z^l = \W^l \y^{l-1} \\ \y^l = \a(\z^l)$

end

# The softmax of the output layer.

$\z^L = \W^L \y^{L-1} \\ \y^L = {\rm softmax}(\z^{L})$

As before $\y = \y^L$ is the output of the network, representing a vector of probabilities for the classes.

Note that in the modified notation, the activation functions $\a()$ take in a vector of inputs, apply the appropriate activation function individually to each component of the vector, and produce a vector output. Thus, if you chose RELU as your activation, $\y = \a(\z)$ would apply the RELU to each component of $\z$ to produce the corresponding component of $\y$. This effectively represents a vectorized version of the activation function. The softmax activation, however, operates exactly as before.

Also note that in the notes above we are assuming all vectors are column vectors. If you use row vectors instead in your code, you will have to transpose all equations, and change the order of multiplication (i.e. replace all $\W\y + \b$ by $\y\W^\top + \b$).

5. “Training” the MLP

We now know how to process an input $\x$ to get a classification output $\y$. But how do we ensure that the output is correct? To do so, we must train the network.

The network as a number of parameters : the weights $w^l_{ij}$ and the biases $\b^l$. The behavior of the network changes according to their value. “Training” the network is the business of learning these values, such that the network performs it tasks correct. E.g., for a hand-written digit classification network, training would be the job of finding the weights and biases of the network, such that when the network is presented with an image of a digit, the output $\y$ correctly assigns the maximum probability to the correct digit.

The actual training process is an iterative process, in which we begin with initial estimates for all the parameters (which may be randomly set). These estimates are then iteratively refined such that the network outputs (mostly) correct answers to a set of “training” instances for which the answers are known. We explain this in a little more detail below.

Before we do so, let us introduce the notation we will use below.

We will represent training instances as $(\x, \d)$. See the next section for an explanation of what $\x$ and $\d$ represent.
Given a set of training instances, we will represent individual instances in it as $(\x^j, \d^j)$.
$\x$ and $\d$ in each training instance are actually vectors. We will represent the $i^{\rm th}$ component of these vectors as $x_i$ and $d_i$.
Given a set of training instances, we represent the $i^{\rm th}$ component of the $j^{\rm th}$ instance as $x^j_i$ and $d^j_i$.
We will represent the output of a network in response to $\x^j$ from the $j^{\rm th}$ training instance as $\y^j$. We will represent its $i^{\rm th}$ component as $y^j_i$.

5.1. The “training data”

We will train the network using a collection of training data. The training data consists of a large collection of $(\x,\d)$ pairs, where $\x$ is an input data instance, and $\d$ is the actual class label for that instance (expressed as a one-hot vector). The $\x$ represents the numeric input data (sometimes called the “features”) of the instance that would be presented to the network. $\d$ is the desired output of the network -- what you want the network to ideally output when presented with this training instance. For instance, a training instance may comprise an image of the digit 3, along with a (one-hot representation of its) class. Ideally, if the network is presented with the $\x$ from the training instance, the output $\y$ must exactly be equal to the $\d$ for that instance. Training tries to make this happen.

These training data are assumed to be similar to the test data that will be encountered when the network is being used operationally.

5.2. The “loss”

The actual training procedure is iterative. It starts with an initial guess $w_{ij}^{l,(0)}$, $b_i^{l,(0)}$ (where the second superscript $(0)$ indicates that this is the initial estimate). When the $\x$ from any training instance $(\x,\d)$ is processed by the network, it outputs a $\y$ from it. Ideally this $\y$ must be equal to $\d$. In practice, the two will not be the same, particularly in the initial stages of training. There will be a discrepancy between the desired output $\d$ and the actual output, $\y$. Training attempts to minimize this discrepancy for all training instances.

In order to do so, we will need to quantify the discrepancy. This is generally done through a divergence function $\div(\y,\d)$ which has the following properties:

$\div(\y,\d)$ is a scalar real number, i.e. $\div(\y,\d) \in \mathbb{R}$.
$\div(\y,\d) \gt 0$ for all $\y \neq \d$, and
$\div(\y,\d) = 0$ if $\y = \d$.

A number of divergence functions have been defined in the literature. The two that find the most use in deep learning are the following.

$L_2$ divergence, and the $KL$ divergence, which are defined as follows.

$L_2$ divergence: The $L_2$ divergence is most used when the neural network is attempting to predict real-valued vectors, i.e. $\d \in \mathbb{R}^M$. This would happen if, for instance, the network is predicting vectors of stock-market values, or the spectrum of an audio signal, or an image. The output $\y$ too will be a vector of real-valued numbers. The divergence is defined as \[ \div(\y,\d) = \|\y - \d\|^2_2 = \sum_{i=1}^M (y_i - d_i)^2 \] where $y_i$ and $d_i$ are the $i^{\rm th}$ components of the actual and desired outputs of the network.
$KL$ divergence: The KL divergence is used for classification problems. The desired output $\d$ is a one-hot vector that can also be viewed as being a probability distribution that assigns a probability of 1.0 to the correct class and 0 to other classes. The network output $\y$ is a probability distribution over classes. The divergence between the two is defined as \[ \div(\y,\d) = \sum_{i=1}^M d_i \log \frac{d_i}{y_i} = \sum_i d_i \log d_i - \sum_i d_i \log y_i \] Since $\d$ is one hot, where only one component is 1, and the rest are 0, only one of the terms in the summation has a non-zero multiplier, so this reduces to \[ \div(\y, \d) = -\log y_c \] where $y_c$ is the probability assigned to the correct class by the network.
A variant of the $KL$ divergence is the cross-entropy loss, which is defined simply as $\div(\y, \d) = -\sum_i d_i \log y_i$, which is the same as the KL divergence except for the $\sum_i d_i \log d_i$. When $\d$ is a one-hot vector, the KL divergence is identical to the cross-entropy loss, since $\sum_i d_i \log d_i = 0$.

When training, we will have a collection of training instances. We will try to learn the neural network model parameters to minimize the divergence between the network output and desired outputs for all of them.

Let $\mathcal{Tr}$ represent the set of training instances, i.e. $\mathcal{Tr} = \{(\x_1, \d_1),\,(\x_2, \d_2),\cdots,(\x_T, \d_T)\}$, where $T$ is the total number of training instances.

We define a Loss that quantifies the average divergence over all training instances as \[ Loss = \frac{1}{T}\sum_{i=1}^T \div(\y_i, \d_i) \] where, as clarified earlier, $\y_i$ is the network response to input $\x_i$.

The network parameters are trained to minimize this loss. The assumption is that if the network can be tuned to correctly predict the desired output for the instances in the training data, it will also do so for other instances outside it.

5.3. Minimizing the loss through gradient descent

The training philosophy we will use is as follows. Given our current estimate for the parameters, we will compute the discrepancy between the network output and the desired output, as quantified by the loss. Then, for each parameter (weight and bias) we will test how it influences the loss -- whether increasing that parameter increases the loss or decreases it. If increasing the parameter decreases the loss, we will increase the parameter. If increasing the parameter increases the loss, we will decrease it.

The influence of a parameter $w$ (or $b$) on the loss is given by the derivative $\frac{d Loss}{d w}$. The derivative literally computes how much the loss increases ($d Loss$), in response to a small increment of $w$ ($dw$).

The sign of the derivative indicates the direction of change of the loss, in response to a small increment of $w$. If the derivative is positive, increasing $w$ increases the loss. If it is negative, incrementing $w$ results in a negative increment, or decrease, of the loss. (When it is 0, small changes of $w$ no longer affect the loss).
The magnitude of the effect of a minor increment of $w$ on the loss is given by the magnitude of the derivative.

This leads to the following update rule for any parameter $w$: \[ w \longleftarrow w - \eta \frac{d Loss}{d w} \]

The parameter is adjusted in the direction of decreasing loss. $\eta$ is a step size parameter, sometimes also called a “learning rate”. The update rule above is itself an instance of the gradient descent update rule.

We will use the gradient descent update rule to update every parameter -- every weight $w_{ij}^l$ and every bias $b_i^l$ in the network. In the $k^{\rm th}$ iteration of the update, the update operation performed

for $l$ = 1:$L$

for $i$ = 1:$N_l$

for $j$ = 1:$N_{l-1}$

$w_{ij}^{l,(k)} = w_{ij}^{l,(k-1)} - \eta \frac{d Loss}{d w_{ij}^l}$

end

$b_{i}^{l,(k)} = b_{i}^{l,(k-1)} - \eta \frac{d Loss}{d b_{i}^l}$

end

where the superscript $(k)$ represents the estimate obtained in the $k^{\rm th}$ iteration.

5.4. Computing the derivatives for gradient descent

The key component of the above procedure is the derivatives $\frac{d Loss}{d w}$ (or $\frac{d Loss}{db}$), which quantify how the discrepancy between the current network output and desired output changes with parameter value.

In order to compute the derivatives, two steps are required:

Compute the divergence (discrepancy) between the actual and desired outputs of the network, given the current parmeter estimates.
Compute the derivatives w.r.t the parameters (which quantifies how this divergence (discrepancy) changes for small increments of parameter values).

The two steps above are called the “forward pass” and “backward pass” respectively.

5.4.1. The forward pass

In the forward each each training instance $\x_i$ is passed through the network to obtain the output $\y_i$.

The divergence $div(\y_i, \d_i)$ is computed from $\d_i$ and the obtained $\y_i$.

5.4.2. The backward pass

In the backward pass, for each training instance, we compute backward from the output layer of the network to the input layer, computing $\frac{d div(\y, \d)}{d w_{ij}^l}$ and $\frac{d div(\y, \d)}{d b_{i}^l}$ as we move along backwards.

We will skip the details of the backward pass for now, other than to note the computation results in the above-mentioned derivatives. The overall derivatives are computed as \[ \frac{d Loss}{d w_{ij}^l} = \frac{1}{T} \sum_{i=1}^T \frac{d div(\y_i, \d_i)}{d w_{ij}^l} \\ \frac{d Loss}{d b_{i}^l} = \frac{1}{T} \sum_{i=1}^T \frac{d div(\y_i, \d_i)}{d b_{i}^l} \]

The computed derivatives are then inserted into the algorithm of Section 5.3 to obtain updates.

5.5. Batches, mini-batches and SGD

The net derivative computed in Section 5.4.2 is the average of the derivatives for all the training instances, implying that the forward and backward pass must be computed over the entire training data set before each update. In practice, this is inefficient and wasteful of computation.

Instead, we will partition the training data into mini batches of a small number of instances (typically between 16 and 1024).

We will compute the derivatives using the formulae from Section 5.4, but over mini batches, instead. The parameter values are updated after each mini batch.

A single pass over the entire training data will result in many updates, one from each minibatch. If each minibatch is size $T_b$, then we will obtain $\frac{T}{T_b}$ updates per pass over the data. A single pass over the entire training data is referred to as an epoch in the jargon.

In order to ensure stable training, the minibatches are randomized, such that the minibatches in consecutive epochs are not identical. In addition, the learning rate $\eta$ is reduced, or decayed, with updates. A variety of decay schedules have been proposed in the literature.

Linear decay: \[ \eta_k = \frac{1}{\beta k+1}\eta_0 \] where $\eta_0$ is the initial learning rate (which is typically around 0.1), $\eta_k$ is the learning rate in the $k^{\rm th}$ epoch (pass through the entire training data), and $\beta$ is a decay rate, typically set to a value $\beta \leq 1$.
Exponential decay: \[ \eta_k = \beta^k \eta_0 \] $k$ is the epoch number and $\beta$ is a decay rate that is a number less than 1.
Staircase: Here the initial learning rate is held constant until the loss plateaus, then it is scaled down by a factor $\beta$. This is similar to exponential decay, except that the learning rate is not scaled down every epoch, but either after the loss plateaus with the current rate, or after a fixed number of updates.

Other such schedules have been proposed.

5.6. Optimizers and other magic

In many problems, the simple gradient descent update rule of Section 5.3 will be too slow or get stuck in poor solutions. To prevent this, a variety of advanced methods, such as momentum based methods, Nestorov's metohd, and other second order methods such as RMSprop, ADAM, ADAGrad etc. have been proposed. We refer students to this amazing blog by Sebastian Ruder for an excellent description of these methods.

In fact, we highly encourage you to read Ruder's blog, as you will almost certainly be using some of these methods in your assignments. HW1, for instance, gets some of its best results using ADAM.

6. Setting up for HW1 Part 1

Your first task for homework 1 is to program an activation and cost function. When doing so, it is important that you consider the following:

6.1. Recommendation 1: Understand the pure math representation of the activation function:

Determine the function mapping from inputs to outputs,
Identify the variables of interest and their sizes,
Identify the operations of interest between the variables,
Determine the meaning of any functions operating on the variables,
Verify that the operation between the variables preserves the mapping.

6.2. Recommendation 2: After you understand the math, you must convert the math to code:

Verify that the input variable types and sizes are what you expect
Define any intermediate variables and sizes if necessary (i.e. for composition)
Determine the python version of the math operators
Determine the python version of the intermediary functions
Define the relationship between the variables and functions using the operators
Verify that the operation between the variables preserves the mapping

After taking the derivative to identify what is used in backward propagation, you can also apply the same recommendations.

Once you have completed Recommendation 1 and Recommendation 2 for forward, backward, and derivative, you are ready to execute the AutoGrader. In general, these procedures should work when creating any function that has a known mathematical representation.

7. Setting up for HW1 Part 2

Your second task for homework 1 is to create a multilayer perceptron for digit recognition. As tempting as it may be to create your own custom specification, more often than not you are better off implementing an existing model architecture that is known to solve your problem or some variant of that problem. We recommend that once you are confident your model is working as expected, then you should experiment by doing customizations (e.g. adding some fancy loss or normalizer) or modifications (e.g. changing hyperparameters or creating ensembles). More often then not, the existing specifications that are known to work for others should work for you if you implemented them correctly.

With that said, we can get into the specifics of neural network implementation. There are three phases

7.1. Phase 1: Data Assessment

Read the data specification. You should know how the data was measured, what its data structure is, and why it is that way.
Do some research. You should see how people process or transform data in the literature and the intuition behind why they make these transformations. The intuition part is important because it will determine where the features are being mapped from and what type of information we are trying to extract vis a vis that mapping.

7.2. Phase 2: Model Assessment

Choose a core model. This step is pretty straightforward, either choose a model we recommend or one that you found in the literature.
Make sure the model solves the problem. Expecting a high evaluation metric like accuracy is great, but means nothing if it is expected for a problem different than the one you intend to solve.
Make sure you have the resources to use your model. It is also great to expect a high evaluation metric for the problem you want to solve, but if the authors used eight Tesla V100 in parallel to solve the problem in 2 weeks for your same amount of data, then maybe consider an alternative model.
Identify model inputs and outputs. Though the input of the model may differ from how the data is originally presented, you should know what your model will expect. For the outputs, make sure you know if you need to adapt the original model to work for more or fewer classes so its result is consistent with the problem statement.
Identify layer-to-layer inputs and outputs. This step is similar to step 4. If you do this step, you will have not only a better idea of how the data is being processed by the model by how the layers relate to each other. This may or may not be specified by where you got your architecture from, but you should figure out what these intermediary forms should be even if you yourself need to do the calculation.

7.3. Phase 3: Incremental Model Development

We start with k = 0. Initially, layer k is layer 0, which is the input layer.
Create a fake input batch. Based on what you understand the inputs to be to layer k, create a fake batch of data (e.g. using a class like torchvision.datasets.FakeData for image data, or np.random.array for alternative specifications). In so defining, you will know that the data is in the expected form for the layer from which the model will learn. We recommend using variables to create this batch – “length”, “height”, “filter_dim1”, “batch_size”, etc. – instead of hard-coding these values.
Create the kth layer. This layer should be specified by the model you chose. In defining the parameters, we recommend you use the same variables you used to create the fake batch, e.g. in_features = length. Similarly, the outputs of the model can be defined using variables, whose application may look something like out_features=hiddens1.
Test the kth layer. Defining a PyTorch model layer instantiates a new function and all new functions should be tested. Pass the fake batch to the dataset and confirm that the output shape is what you expect. The output expectation of this layer is determined by the input expectation of the next layer (or, at the end, the final output of the model). We recommend defining variables for what you expect to use in the next layer, e.g. “hiddens1”, “out_dim”, etc. Then, assert that the outputs of your test are consistent with the defined input variables. Repeat 3 if 4 fails.
The output becomes the input. Skip this step if you are at the output layer. Save the output from your test in step 4, which will become your next “fake input batch.” This will be what you use as the input batch to the next layer you define.
Repeat. Repeat Step 3-5 after incrementing k if you are not at the output layer, else you have started with the model input you expect and have ended with the model output you expect.

7.4. Phase 4: Data Preparation

Simplify and structure the real data. You need to structure your data such that it has a batch specification consistent with what you used for the fake data. Some notes on this:
Structured data is easier to work with and understand.
The main concern when structuring data is losing information that was otherwise in its unstructured form.
Similarly, if you simplify a representation, you are necessarily using a less complex representation.
Less complex representations are easier to learn, but if the problem you are trying to solve requires a certain level of complexity to solve then simplification may be counterproductive. The result of this step will be what you feed into your network, so you could look at what others had as inputs for their models to give you guidance.
Define how to load the data. You need to know how each specific observation will be used by the network, then you need to know how each batch of observation will be received. Classes like DataLoader and Dataset can help with this, though, in some cases, it is nice to create your own method from scratch so you know exactly what is happening to your data. We recommend trying the latter and then doing the former.
Be careful. In this phase, you need to be careful. Then, you need to make sure you were careful. Finally, you verify that you have been careful. If it’s not clear yet, this being careful is very important. You should verify the type and shape of the individual observations and batches; you should verify that the number of classes is what you expect; and you should verify, visually, that the data look reasonable, at least for a few observations, and the statistics (min, max, mean, etc.) look reasonable for the dataset. This step is a sanity check, and, if you skip this step or mess this step up, all future phases will become not only substantially more difficult but may also make it impossible for the model to learn from your corrupted data specification.

7.5. Phase 5: Define Batch Training

Your batch training method should execute the following in order:

model.train(). We need to tell PyTorch that we are training.
Get the inputs and labels, which depends on how you defined the train data loader.
optimizer.zero_grad(). We need to start with a fresh set of gradients for our data.
Get the softmax output by passing the inputs to your network.
Calculate the loss between the softmax and actual labels for your criterion.
loss.backward(). We need to perform backward propagation and accumulate gradients.
optimizer.step(). We need to update the parameter estimates.
Print batch statistics, possibly cumulatively and after every couple batches.
NOTE: There may be intermediary steps between these functions, though they must eventually be executed in this order. Messing up this order may result in a model that does

7.6. Phase 6: Define Epoch Training

You should execute the batch training for each batch of your data loader for each epoch.

7.7. Phase 7: Model Evaluation

model.eval(). We need to tell PyTorch that we are evaluating.
Get the inputs and labels, which depends on how you defined the dev data loader.
Calculate the loss between the softmax and actual labels for your criterion.
Get the arg max of the softmax layer to get your predicted label.
Compare your predicted label to the actual label. Accumulate the count of unity results.
If you want accuracy, divide the unity results by the number of observations.

7.8. Phase 8: Evaluation Visualization

This part helps give the intuition behind how your model is learning and is nice for papers. We recommend, for each class, creating a scatter plot of the 2nd neuron activations versus the 1st neuron activations. You can then merge respective scatter plots into one plot. This will be a nice snippet of code, since, once it is done, you can copy and paste it to use in other projects.

7.9. Phase 9: Model Testing

model.eval(). We need to tell PyTorch that we are evaluating.
Get the inputs, which depends on how you defined the test data loader.
Calculate the loss between the softmax and actual labels for your criterion.
Get the arg max of the softmax layer to get your predicted label.
Add the results to an array or data frame, possibly with the intent of creating a *.csv file.