Introduction to Neural Networks

We now shift from our discussion of simple linear models to neural networks. Neural networks are a class of models that go beyond linear classifiers. Recall that the three main components of a machine learning algorithm are the hypothesis function, the loss function, and the optimization method. In the context of neural networks, the loss function remains the cross-entropy loss, and the optimization method continues to be stochastic gradient descent (SGD).

The key difference in neural networks is the hypothesis function, which is no longer a simple linear model but a more complex structure capable of capturing nonlinear relationships in the data. The lecture emphasizes that neural networks and deep learning are effectively synonymous terms in the current context.

The Neural Network Hypothesis Function

Neural networks represent a shift from linear classification methods to more complex forms. The core methodology remains consistent between linear classification and neural networks, with the primary distinction being the hypothesis function utilized. In linear classification, the hypothesis function is of a simpler form, typically represented as \(h(x) = \theta^T x\). However, neural networks employ a more intricate hypothesis function, which can capture a broader range of relationships within the data.

From Linear to More Elaborate Hypothesis Functions

The transition from a linear hypothesis function to a more complex one necessitates a more elaborate computation for the gradient, especially when employing stochastic gradient descent (SGD). SGD requires the computation of the derivative of the gradient of the loss function with respect to the parameters. As the function of the parameters becomes more intricate, so does the computation of the gradient.

Importance of Input Representation

The representation of input data significantly impacts the expressiveness of the model, particularly for linear models. For instance, a simple linear hypothesis function applied to a one-hot vector concatenation of two words from a language model cannot represent all possible probabilities of the next word given the previous two words. This limitation is evident through a counting argument, where the number of parameters is insufficient to represent all possible combinations of two previous words to predict the next word.

To enhance the expressiveness of the model, a feature function \(\phi\) can be introduced to transform the raw input into a more complex representation. This transformation maps an \(n\)-dimensional input to a \(d\)-dimensional space, allowing for a more expressive hypothesis function of the form \(h(x) = \theta^T \phi(x)\). The parameters \(\theta\) would then be a \(d \times k\) matrix, rather than an \(n \times k\) matrix.

Example: Language Model

In the context of a language model, if the input \(x\) is the concatenation of two one-hot vectors representing the past two words, the feature function \(\phi\) could be defined as the Kronecker product of these vectors. This product results in a vector that enumerates all possible values of the previous two words, allowing the model to represent all possible probabilities of the next word given the previous two words.

Historical Context of Feature Engineering

Historically, machine learning involved extensive hand-crafting of input features to enable the use of simple models, such as linear models, for complex tasks. For example, in image classification, rather than using raw pixel values, researchers would design elaborate features like histograms of gradients to capture the essential characteristics of the images. This feature engineering was a significant part of machine learning before the advent of deep learning, which shifted the focus towards learning representations directly from data.

Neural Networks: Feature Learning Motivation

Prior to the rise of neural networks, complex linear models with sparsity and specialized features were the dominant methods in machine learning competitions, such as the one in 2012. These methods were popular until the early 2010s, despite deep learning being explored throughout that period.

The traditional approach in machine learning involved manually crafting an expressive set of features and writing feature functions in code, which was considered state-of-the-art for a long time. However, a shift in perspective led to the idea of making the features themselves a learnable function. The goal was to extend the learning process not only to the functional mapping that creates the output but also to the representation of the input itself. This would involve applying optimization to both the learnable feature function and the parameters on top of that function.

Linear Functions and Their Limitations

Initially, the simplest approach to creating a feature mapping \(\phi\) from \(\mathbb{R}^n\) to \(\mathbb{R}^d\) was to use a linear function, defined as \(\theta(x) = W^{\top}x\), where \(W\) is an \(n \times d\) matrix. However, this approach was quickly identified as flawed because it did not increase representational power. The reason is that the composition of linear functions is still a linear function, which can be represented as \(\theta^{\top}W^{\top}x = (\theta^{\top}W^{\top})x = \tilde{\theta}^{\top}x\), where \(\tilde{\theta}\) is an \(n \times k\) matrix. This realization showed that simply stacking linear functions does not create a more expressive hypothesis class.

Introduction of Nonlinearities

To overcome the limitations of linear functions, the proposal was to apply a nonlinear function \(\sigma\) to the linear function, creating a feature mapping \(\phi(x) = \sigma(Wx)\). This nonlinear function could be any nonlinearity, such as a sigmoid function, a hyperbolic tangent (tanh), or a rectified linear unit (ReLU), which is defined as \(\sigma(x) = \max(0, x)\). The introduction of nonlinearity prevents the simplification that occurred with linear functions, as the hypothesis function \(h_{\theta}(x) = \theta^{\top}\sigma(W^{\top}x)\) cannot be reduced to a single linear transformation. This creates a more powerful representation that cannot be expressed as a linear function alone.

The term “neural network” is derived from the analogy to neuron activations in the brain, particularly when using sigmoid functions that output values between 0 and 1. This analogy, while useful at a high level, is not at all an accurate model of the brain.

Multilayer Perceptrons (MLPs)

The concept of a two-layer neural network is extended to what is known as a multilayer perceptron (MLP), also referred to as a feed-forward network or a fully connected network. An MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. The nodes, also known as neurons, in these networks apply a nonlinear activation function to the inputs received.

Deep neural networks are essentially a series of transformations applied to the input data. The idea is to learn features of features, creating a hierarchy of feature representations. The hypothesis function for an MLP is defined as a series of matrix multiplications and nonlinear transformations applied to the input vector.

The hypothesis function \(h_{\theta}(x)\) for an MLP with \(L\) layers is given by: \[ h_{\theta}(x) = W_L^T \sigma(W_{L-1}^T \sigma(\dots \sigma(W_2^T \sigma(W_1^T x))\dots)) \]

Here, \(\sigma\) represents the nonlinear activation function applied element-wise, and \(W_i\) represents the weight matrix for the \(i\)-th layer. The parameter set \(\theta\) is now considered to be a collection of these weight matrices, \(\theta = \{W_1, W_2, \dots, W_L\}\).

The choice of activation function \(\sigma\) is crucial in neural networks. Common activation functions include the rectified linear unit (ReLU), hyperbolic tangent (tanh), and the sigmoid function. Each of these functions introduces nonlinearity into the model, allowing the network to capture complex patterns in the data.

Bias terms are often added to the neurons in a network to shift the activation function to the left or right, which can help with learning. However, networks can perform well even without explicit bias terms, as the network can learn to use the weights to compensate for the lack of biases.

Gradient Descent for Two-Layer Neural Networks

For training deep neural networks, we need to compute the gradient of the loss function with respect to the parameters \(\theta\). This involves differentiating the hypothesis function with respect to each weight matrix in the network. The process of computing these gradients manually can be complex, but modern tools like PyTorch offer automatic differentiation, which simplifies this process. However, we can compute the gradients manually for a two-layer neural network (NN).

The hypothesis class for this network is defined by the function \(h_{\theta}(x)\), which represents the output of the network given an input vector \(x\). The two-layer NN is mathematically represented as:

\[ h_{\theta}(x) = W_2^T \sigma(W_1^T x) \]

where \(\sigma\) denotes the activation function applied element-wise to the vector resulting from the matrix-vector product \(W_1^T x\). The activation function used in this context is the Rectified Linear Unit (ReLU), defined as \(\sigma(x) = \max(x, 0)\). This can also be writen in batch form as

\[ h_{\theta}(X) = \sigma(X W_1) W_2 \]

The matrices \(W_1\) and \(W_2\) are the weight matrices for the first and second layers, respectively. The dimensions of these matrices are as follows:

This setup allows the neural network to transform an input vector of size \(n\) into an output vector of size \(k\) through a hidden layer of size \(d\). The ReLU activation function ensures non-linearity in the transformation, enabling the network to represent nonlinear functions.

Computing Gradients

The goal is to minimize the cross-entropy loss \(\ell_{ce}(h_{\theta}(X), Y)\) over \(\theta\), which involves computing the gradients of the loss function with respect to the parameters \(W_1\) and \(W_2\). The optimization method used is stochastic gradient descent (SGD), which updates the weights by subtracting a small multiple of the gradients from the current weights, iterated over many mini-batches.

To update the weights, the gradients of the loss function with respect to \(W_1\) and \(W_2\) must be computed: \[ \nabla_{W_1} \ell_{ce}(h_{\theta}(X), Y), \nabla_{W_2} \ell_{ce}(h_{\theta}(X), Y) \]

Gradient of the Loss with Respect to \(W_2\)

Let’s start with the gradient with respect to \(W_2\). To make the computation easier, we use the same trick as before, of deriving the gradient by first assuming that everything is a scalar, and then adjusting sizes to fit. Specifically, taking gradients we have \[ \begin{split} \frac{\partial \ell_{ce}(\sigma(W_1 X)W_2, Y)}{\partial W_2} & = \frac{\partial \ell_{ce}(\sigma(W_1 X)W_2, Y)}{\partial \sigma(W_1 X)W_2} \cdot \frac{\partial \sigma(W_1 X)W_2}{\partial W_2} \\ & = (\text{softmax}(\sigma(XW_1)W_2) - I_Y) \cdot \sigma(X W_1) \end{split} \]

where we use the gradient of the cross-entropy loss derived in the previous notes for the first term. Noting that:

the gradient has to be given by

\[ \nabla_{W_2} \ell_{ce}(h_{\Theta}(X), Y) = (\sigma(XW_1))^T(\text{softmax}(\sigma(XW_1)W_2) - I_Y) \]

Gradient of the Loss with Respect to \(W_1\)

Computing the gradient with respect to \(W_1\) is slightly more complex. Again treating quantities as scalars and applying the chain rule, we have

\[ \begin{split} \frac{\partial \ell_{ce}(\sigma(W_1 X)W_2, Y)}{\partial W_2} & = \frac{\partial \ell_{ce}(\sigma(X W_1 )W_2, Y)}{\partial \sigma(X W_1 )W_2} \cdot \frac{\partial \sigma(X W_1 )W_2}{\partial \sigma(X W_1 )} \frac{\partial \sigma(X W_1)}{\partial X W_1} \frac{\partial X W_1}{\partial W_1} \\ & = (\text{softmax}(\sigma(XW_1)W_2) - I_Y) \cdot W_2 \cdot \sigma'(X W_1) \cdot X \end{split} \]

Here the derivative is of the ReLU given by the indicator function of \(x\) being greater than 0, \[ \sigma'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} \] and which is applied elementwise to its argument. The sizes of each term in the derivative is given by:

Making sizes match, the gradient of the loss with respect to \(W_1\) is giben by \[ \nabla_{W_1} \ell_{ce}(\sigma(XW_1) W_2, Y) = X^T \left(\sigma'(XW_1) \circ ( \text{softmax}(\sigma(XW_1)W_2) - I_Y ) W_2^T \right) \] where \(\odot\) represents the element-wise (Hadamard) product.

Implementation of Gradient Descent for a Two-Layer Network

We can use these derivates to traing a simple two layer network. The function epoch_two_layer_neural_network is introduced to compute one epoch of training. The parameters passed to this function include the input data X, the weights W1 and W2, the batch size, and the step size alpha. The ReLU activation function is assumed to be used in the network.

def epoch_two_layer_nn(X_full, Y_full, W1, W2, batch_size=100, alpha=0.0):
    m,n = X_full.shape
    d,k = W2.shape
    losses = []
    errs = []

    for X,Y in zip(X_full.split(batch_size), Y_full.split(batch_size)):
        Z1 = X@W1
        Z2 = torch.maximum(Z1, torch.tensor([[0]]))
        H = Z2 @ W2
        losses.append(loss_cross_entropy(H,Y))
        errs.append(loss_01(H, Y))

        if alpha > 0:
            G2_ = (torch.softmax(H, -1) - one_hot(Y, k))
            G1_ = (G2_ @ W2.T) * (Z1 > 0)

            W1 -= (alpha / X.shape[0]) * X.T @ G1_
            W2 -= (alpha / X.shape[0]) * Z2.T @ G2_
    return torch.tensor(losses), torch.tensor(errs)

The function takes the following parameters: - X: The input data. - W1, W2: The weight matrices for the two layers of the neural network. - batch_size: The size of the mini-batch for training. - alpha: The step size for the gradient update.

The function is designed to update the weights W1 and W2 in place, following a non-functional programming approach typical in Python. The batch size is set to 100, and the step size alpha defaults to 0.0, indicating no updates are performed, and the function only evaluates the network.

Intermediate Computation and Loss Calculation

Intermediate terms are computed to reuse computations and improve efficiency. The terms Z1 and Z2 represent the outputs of the first and second layers before activation, respectively. The final output H is the result of applying the second layer’s weights to Z2. The loss is computed using the cross-entropy loss function applied between the output H and the true labels Y. A matrix of losses is created to store the loss for each mini-batch.

Initialization of Weights

The lecture concludes with a discussion on the initialization of weights. Initializing the weight matrices \(W_1\) and \(W_2\) to all zeros would result in zero gradients, making it a saddle point in the optimization landscape. This is a significant departure from linear regression, where the initialization does not affect the convergence due to the absence of local optima. In contrast, neural networks have local optima, and the initialization of weights becomes an important consideration.

To avoid poor initial values, such as a saddle point where gradients are zero but the value is suboptimal, weights are initialized randomly. This random initialization is key to escaping local optima and allows gradient descent to update the weights effectively. The weights are initialized as random matrices drawn from a Gaussian distribution. The dimensions of the weight matrices are determined by the dimensions of the input data and the number of classes. For instance, if \(n\) is the number of features, \(D\) is the dimension of the hidden layer, and \(K\) is the number of classes, then the weight matrix \(W_1\) has dimensions \(n \times D\), and \(W_2\) has dimensions \(D \times K\).

It is recommended to normalize the weight matrices by the square root of the dimension of the previous layer, which ensures that later activates has a variance equal to one (this is discussed more in a subseqent lecture). This normalization can be expressed as multiplying the random matrix by \(\sqrt{\frac{2}{n}}\) for \(W_1\) and \(\sqrt{\frac{2}{d}}\) for \(W_2\). This normalization is particularly important for larger neural networks and will be discussed in more detail later.

Performance of the neural network

Our final training of the network can be accomplished with the following code:

n = 784
d = 500
k = 10
epochs = 10

W1 = torch.randn(n,d) * np.sqrt(2/n)
W2 = torch.randn(d,k) * np.sqrt(2/d)

print("Epoch | Train Loss | Test Loss | Train Err | Test Err |")
for t in range(epochs):
    train_losses, train_errs = epoch_two_layer_nn(X_train, Y_train, W1, W2, batch_size=100, alpha=0.5)
    test_losses, test_errs = epoch_two_layer_nn(X_test, Y_test, W1, W2, batch_size=100, alpha=0.0)
    print(f"{t}     | {train_losses.mean():.5f}    | {test_losses.mean():.5f}   | {train_errs.mean():.5f}   | {test_errs.mean():.5f}  |")

In practice, it is common to print out the loss and error metrics periodically to track the training progress. Tweaking the learning rate and the initialization scale can lead to better training conditions and improved model performance. Running the above code for 10 epochs, the two-layer neural network is trained on the MNIST dataset in 2.2 seconds, achieving an error rate of 2.6%. This performance is highlighted as being significantly better than any linear classifier, which typically achieves around 7.5% error.

Backpropagation for general multilayer MLPs

An L-layer neural network extends the previously discussed two-layer neural network. The notation used for an L-layer network includes the use of \(Z_i\) to represent the activations of intermediate layers. The relationship between layers is defined recursively, with \(Z_{i+1}\) being the result of applying a linear transformation followed by a non-linear activation function to \(Z_i\). This can be expressed as:

\[ Z_{i+1} = \sigma(Z_i W_i) \]

where \(\sigma\) represents the non-linear activation function, and \(W_i\) is the weight matrix associated with the \(i\)-th layer. The process starts with \(Z_1\) being equal to the input \(X\), and the output of the network is denoted as \(Z_{L+1}\), which is also the function \(h_\theta(X)\). In all cases we assume that \(Z_i \in \mathbb{R}^{m \times n_i}\) and thus \(W_i \in \mathbb{R}^{n_i \times n_{i+1}}\)

Derivation of Backpropagation Equations

Backpropagation is the name of the method used to derive gradients for weights in such a network: the process is a essentially a combination of the chain rule and proper caching of intermediate terms. Again, the discussion here will proceed by first assuming all terms are scalars, then eventually compute the size of all needed terms by matching sizes. The gradient of the loss with respect to \(W_i\) is computed using the chain rule, which can be written as:

\[ \frac{\partial \ell(Z_{L+1}, y)}{\partial W_i} = \frac{\partial \ell}{\partial Z_{L+1}} \frac{\partial Z_{L+1}}{\partial Z_L} \cdots \frac{\partial Z_{i+1}}{\partial W_i} \]

The lecture emphasizes the cumbersome nature of writing out every single derivative in the chain rule application. To simplify, the notation is condensed, and the focus is placed on understanding the process rather than the detailed computation for each layer. The recursive nature of the backpropagation algorithm is highlighted, where the derivative of each layer’s output with respect to its input is computed in a backward pass through the network.

If we were to write out a similar expression for each \(W_i\), we would find that many terms are repeated. To avoid repetetive computation, the backpropagation method caches intermediate terms and reuses them to compute all the needed gradients. Specifically, the gradient term \(G_i\) is defined as \[ G_i = \frac{\partial \ell(Z_{L+1}, y)}{\partial Z_i} \]

Using the chain rule as above, we can also write this equation recursively as \[ G_i = G_{i+1} \frac{\partial Z_{i+1}}{\partial Z_i} = G_{i+1} \frac{\partial \sigma(Z_i W_i)}{\partial Z_i W_i} \frac{\partial Z_i W_i}{\partial Z_i} = G_{i+1} \cdot \sigma'(Z_i W_i) \cdot W_i \]

Noting that \(G_i\) must be a \(m \times n_i\) matrix, and noting that

we have that the correct backward iteration is given by \[ G_i = (G_{i+1} \circ \sigma'(Z_i W_i)) W_i^T \] with the initial condition \[ G_{L+1} = \nabla_{Z_{L+1}} \ell(Z_{L+1},Y) \] which e.g., equation \(G_{L+1} = \text{softmax}(Z_{L+1}) - I_Y\) for cross entropy loss. This equation ensures that the gradient \(G_i\) has the correct dimensions of \(m \times n_i\).

Gradient with Respect to Weight Matrix

To update the weights during training, the gradient of the loss with respect to the weight matrix \(w_i\) is needed. This derivative is computed as:

\[ \frac{\partial \ell}{\partial W_i} = \frac{\partial \ell}{\partial Z_{i+1}} \cdot \frac{\partial Z_{i+1}}{\partial W_i} = G_{i+1} \frac{\partial Z_{i+1}}{\partial W_i} \]

This can be expanded using the chain rule as above to \[ \frac{\partial \ell}{\partial W_i} = G_{i+1} \cdot \sigma'(Z_i W_i) \cdot Z_i \]

Matching sizes as before, the correctly-sized gradient of the loss with respect to \(W_i\) is given by \[ \nabla_{W_i} \ell(Z_{L+1},Y) = Z_i^T (G_{i+1} \circ \sigma'(Z_i W_i)) \]

The resulting gradient has the dimensions that match the weight matrix \(W_i\).

Neural Network Forward and Backward Passses

The complete steps for computing all necessary gradients in the network can thus be expressed in terms of a forward pass that computes all the activations, and a backward pass that computes all the gradients. Specifically, we can write these as

Forward Pass

  1. Initialize the first layer with the input data: \(Z_1 = X\).
  2. For each layer \(i\) from 1 to the number of layers \(L\), compute the next layer as \(Z_{i+1} = \sigma(Z_i W_i)\), where \(\sigma\) is the activation function, \(Z_i\) is the output of the previous layer, and \(w_i\) is the weight matrix for layer \(i\).

Backward Pass

  1. Initialize the gradient of the last layer with respect to the loss: \(G_{L+1} = \nabla_{Z_{L+1}} \ell(Z_{L+1},Y)\).
  2. For each layer \(i\) from \(L\) down to 1, perform the following steps:

Automatic Differentiation

In practice, you will not want to compute the manual derivation of the backpropagation equations, especially not for complex network architectures. Instead, automatic differentiation is a method that enables the automatic calculation of derivatives in a way that is both efficient and exact. It is based on the construction of a computational graph, which is a directed acyclic graph (DAG) that represents the sequence of operations performed during a computation. In this graph, squares represent variables, which are intermediate terms computed during the process, and circles represent functions, which are operations applied to these variables. The edges in the graph indicate the direction of computation.

Example of a Computational Graph

To represent a computation in a graph, consider the operation \(z_{i+1} = \sigma(W_i^T z_i + b_i)\), where \(\sigma\) is an activation function, \(W_i\) is a weight matrix, \(b_i\) is a bias term, and \(z_i\) is the input vector. This operation can be decomposed into a series of steps: 1. Multiply \(W_i^T\) and \(z_i\). 2. Add the result to the bias term \(b_i\). 3. Apply the activation function \(\sigma\) to obtain \(z_{i+1}\).

Each step corresponds to a node in the graph, and the edges represent the flow of data through these operations.

Gradients in Computation Graphs

The computation graph’s final output must be a scalar, often represented as a loss function in machine learning. To facilitate backpropagation, each node in the graph maintains the derivative of the loss with respect to that node, which is equivalent to the gradient of the loss with respect to the node’s value.

For example, if the final node represents the loss \(\ell\), the derivative of the loss with respect to itself is always one, \(\frac{\partial \ell}{\partial \ell} = 1\). To compute the derivatives for other nodes, the chain rule is applied. For a function \(f(x, y)\) that contributes to the loss, the derivative of the loss with respect to \(y\) is given by:

\[ \frac{\partial \ell}{\partial y} = \frac{\partial \ell}{\partial z} \cdot \frac{\partial z}{\partial y} \]

where \(z = f(x, y)\). The term \(\frac{\partial \ell}{\partial z}\) is obtained from the downstream nodes of the graph, assuming that the derivatives for all downstream nodes have already been computed.

To implement a function within an automatic differentiation framework, two key components must be established: the forward call (i.e., just implementing the function \(f\)) and the ability the compute this product of derivatives, known as a vector-Jacobian product.

  1. Forward Call Implementation:
  2. Vector-Jacobian Product (VJP) Implementation:

The process of backpropagation in automatic differentiation frameworks relies on local information. This means that to compute the derivative of a function at a particular point in the computation graph, only the function itself and the terms already computed in the graph are needed. There is no requirement for knowledge of the graph’s structure before or after the function in question.

Example: Matrix Multiplication Function

Consider a matrix multiplication function with matrices \(X \in \mathbb{R}^{m \times n}\) and \(y \in \mathbb{R}^{n \times p}\).

To compute the gradient of the loss \(L\) with respect to \(y\), the chain rule is applied: \[ \frac{\partial L}{\partial y} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial y} \] Given that \(z = x \times y\), the derivative \(\frac{\partial z}{\partial y}\) is simply \(x\). Therefore, the gradient with respect to \(y\) is: \[ \frac{\partial L}{\partial y} = \frac{\partial L}{\partial z} x \]

The dimensions of the matrices are important for the computation. If \(x\) is an \(m \times n\) matrix and \(y\) is an \(n \times p\) matrix, then \(z\) will be an \(m \times p\) matrix. The gradient \(\frac{\partial L}{\partial y}\) should also be an \(n \times p\) matrix. This is achieved by computing the transpose of \(x\) multiplied by the gradient of \(L\) with respect to \(z\): \[ \frac{\partial L}{\partial y} = x^T \frac{\partial L}{\partial z} \]

Similarly, the gradient of the loss \(L\) with respect to \(x\) is computed as: \[ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial z} y^T \]

Implementation

In practice, the implementation of the VJP and the forward call can be done by defining Python functions that perform the necessary operations. For example, a simple Python function for the forward pass of matrix multiplication might look like this:

def matmul_forward(x, y):
    return x @ y

The backward pass would then look something like:

def matmul_backward(grad_l):
    # Retrieve values of x,y from forward pass
    return grad_l @ y.T, x.T @ grad_l

At this point, the course will transition from the use of manual gradients to automatic differentiation libraries like PyTorch to actually compute gradients. Although we won’t implement our own automatic differentiation library for this course, you will see how these operations work using the PyTorch library.