Lecture 3: Multi-layer Perceptron (2024)

The Perceptron

In the last lecture, we discussed supervised learning with a linear hypothesisclass of the form

parametrized by $n$ weights $\bb{w} = (w_0,w_1,\dots,w_n)$ and abias $b$. In the machine learning literature, this family of functions (or“architecture” as we shall call it in the sequel) is known as a (linear)perceptron.

We have seen that in the case of binary logistic regression (which, despite thename, is a binary classification problem) the scalar output $y$ of thehypothesis was further fed into the logistic (a.k.a. sigmoid function

This can be viewed as a two-dimensional output of the form

which can be interpreted as the vector of probabilities of the instance$\bb{x}$ belonging to each of the two classes.

Using this perspective, the linear perceptron model can be generalizedto the $k$ class cases according to

where $\bb{W}$ is a $k \times n$ weight matrix whose rows aredenoted as $\bb{w}_i$, $\bb{b}$ is a$k$-dimensional bias vector, and $\bb{1}$ is anappropriately-sized vector of ones. This generalization of the logisticfunction used to normalize the output intot the form of a vector ofprobabilities is known as softmax. Softmax is a function of the form

that highlights the maximal value in the vector $\bb{z}$ andsuppresses other elements that are significantly lower than the maximum.

Adding layers

The linear perceptron model is rather limited due to its linearity. For example,it cannot produce the XOR function. A much more powerful family of functions isobtained by applying a non-linearity to the output of a linear perceptron andconcatenating several such models. We define the $i$-th layer as

for $i=1,\dots,L$, where $\bb{y}_{i-1}$ is an$n_{i-1}$-dimensional input, $\bb{y}_i$ is an$n_{i}$-dimensional output, $\bb{W}_i$ is an$n_i \times n_{i-1}$ matrix of weights (whose columns are denoted as$\bb{w}^{i}_1,\dots, \bb{w}^{i}_{n_{i-1}}$),$\bb{b}_i$ is an $n_i$-dimensional bias vector, and$\varphi_i : \RR \rightarrow \RR$ is is a non-linear function appliedelement-wise. Setting $\bb{y} = \bb{y}_L$ and$\bb{y}_0 = \bb{x}$, a multi-layer perceptron(MLP) with $L$ layers is obtained. MLP can be described by the followinginput-to-output map

parametrized by the weight matrices${ \bb{W}_1,\dots,\bb{W}_L }$ and bias vectors${ \bb{b}_1,\dots,\bb{b}_L }$ which we willcollectively denote as a pseudo-vector $\bb{\Theta}$.

Graphically, the $i$-th layer can be thought of a weighted directed graphconnecting each of the $n_{i-1}$ inputs to $n_i$ sum nodes with theweights given by elements of $\bb{W}_i$. The output of eachsum node undergoes a non-linearity and together the $n_i$ outputs formthe input of the following layer. Because of its (deliberate)resemblance to biological neural networks, MLP is called an (artificial)neural network. In the jargon of artificial neural networks, eachsub-graph of the form$y^i_j = \varphi_i ( \bb{y}_{i-1}^\Tr \bb{w}^{i}_j + b_j )$is called a neuron (the $j$-th neuron in $i$-th layer), itsnon-linearity $\varphi_i$ is called an activation function, and itsoutput $y^i_j$ an activation. MLP is a feedforward neural network,since the graph is acyclic – the data flow forward from the input to theoutput without feedback loops.

Unlike their single-layered linear counterparts, MLPs constitute apotent hypothesis class. In fact, even with just two layers, MLPs wereshown to be universal approximators – their weights can be selected toapproximate any function under mild technical conditions, provided theyhave enough degrees of freedom (sufficiently large number of weights).

Non-linearity

Various functions can be used as the element-wise nonlinearities(activation function) of the MLP. Older neural networks used thelogistic function (a.k.a. sigmoid)

saturating the input in $\RR$ between $0$ and $1$, or its shifted andscaled version

The arctangent function also has a sigmoid-like behavior.

However, due to numerical issues that will be discussed in the sequel, thesefunctions were nowadays almost universally replaced by the rectifier function(a.k.a. rectified linear unit or ReLU)

Note that this function has the derivative of exactly $0$ on $(-\infty,0)$,exactly $1$ on $(0,\infty)$, and is non-smooth at $0$. These facts justifyingits choice will be discussed in the sequel.

In addition to element-wise non-linearities, modern neural networks sometimesuse “horizontal” non-linearities acting on the entire activation vector. Onetypical choice of such a non-linearity adopted in classification networks is asoftmax function applied to the activation of the last (output) layer. Othernon-linearities of this kind are pooling operations that will be discussed inthe sequel.

Supervised training

Now equipped with a new richer hypothesis class, let us zoom out to see thewhole picture. In the supervised learning problem, we are given a finite sampleof labeled training instances ${ (\bb{x}_i, y_i) }_{i=1}^N$. Wethen select a hypothesis that minimizes the empirical (in-sample) loss function,

In our terms, this minimization problem can be written as

Global and local minima

Let us assume that $L$ is a function of an $m$-dimensional argument$\bb{\theta}$ defined on all $\RR^m$ (we can always parse all thedegrees of freedom of our neural network into an $m$-dimensional vector). Apoint $\bb{\theta}^\ast$ is called a global minimizer of $L$ if forany $\bb{\theta}$, $L(\bb{\theta}) \ge L(\bb{\theta}^\ast)$.The corresponding value of the function,$L(\bb{\theta}^\ast)$, is called a global minimum. The latter termis often (strictly speaking, erroneously) used to denote the minimizer as well.A point $\bb{\theta}^\ast$ is called a local minimizer of $L$ ifthere exists $\epsilon > 0$ such that $\bb{\theta}^\ast$ is a globalminimizer of $L$ on the ball $B_\epsilon(\bb{\theta}^\ast)$.

Unless $L$ satisfied special properties (such as convexity), finding its globalminimizer is an unsolvable problem. On the other hand, finding a local minimizeris a much easier task, since local minimizers can be characterized using localinformation (i.e., derivatives). Assuming $L$ is $\mathcal{C}^1$, fromelementary multivariate calculus we should recollect the first-order necessarycondition for $\bb{\theta}^\ast$ being a local minimizer:

Obviously, this is not a sufficient condition – in fact, a local maximum and asaddle point also satisfy it. However, the latter two types of extremal points(characterized by negative curvature) are unstable, which will allow methodssuch as stochastic gradient descent not to remain stuck at such points.

As a reminder, the gradient of a multi-variate function is an operator$\nabla L : \RR^m \rightarrow \RR^m$. At a given point$\bb{\theta}$, it produces a vector$\bb{g} = \nabla L(\bb{\theta})$ satisfying

in other words, an inner product of the argument change$\bb{d\theta}$ with the gradient yields the differential $dL$.

Gradient descent

We can therefore suggest a very simple iterative strategy for finding a localminimum, which can be summarized as the following “algorithm”:

Starting with some initial guess $\bb{\theta}_0$, repeat for$k=1,2,\dots$

Select a descent direction $\bb{d}_k$
Select a step size $\eta_k$
Update$\bb{\theta}_k = \bb{\theta}_{k-1} + \eta_k \bb{d}_k$
Check optimality condition at $\bb{\theta}_k$ and stop ifminimum is reached

(In practice, rather than checking the optimality condition, we will run thealgorithm for a fixed number of iterations and stop it prematurely based on thevalue of cross-validation loss – these details will be discussed further in thecourse.)

The main ingredient of the above “algorithm” is the choice of the descentdirection, i.e., a direction a (small) step in which decreases the value of thefunction. Let $\bb{\theta}$ be our current iterate (we drop theiteration subscript) and let $\bb{d}$ be a direction. Once adirection is choses, we can consider a one-dimensional “section” of the$m$-dimensional function $L$,

The quantity

is known as the directional derivative of $L$ at point$\bb{\theta}$ in the direction $\bb{d}$. A negativedirectional derivative indicates that a small step in the direction$\bb{d}$ decreases the value of the function. Geometrically, thismeans that a descent direction forms an obtuse angle with the gradient (or anacute angle with the negative gradient).

Let us now approximate our function linearly around$\bb{\theta}$,

and ask ourselves what direction minimizes the difference$L(\bb{\theta}+\bb{d}) - L(\bb{\theta})\approx \nabla L(\bb{\theta})^\Tr \bb{d}$ – we could callsuch a direction the steepest descent direction. Obviously, this linearapproximation is unbounded, so we need to normalize the length of$\bb{d}$. Different choices of the norm lead to different answers (sothere are many steepest directions); in the $\ell_2$ sense we obtain

This choice of the descent direction leads to a family of algorithms known asgradient descent.

Our next goal is to select the step size $\eta$. Ideally, once we have thedirection $\bb{d}$, we would like to solve for

While there exist various methods known as line search to solve such aone-dimensional minimization problem, usually they come at the expense ofunaffordable extra complexity. In deep learning, a much more common choice is touse a vanishing sequence of weights that start with some initial $\eta_0$ whichis kept for a certain number of iterations and then gradually reduced as $1/k$.Using the statistical mechanics metaphor, such a reduction in the step sizeresembles a decrease in temperature and is therefore referred to as annealing.

Gradient descent can be thus summarized as

Starting with some initial guess $\bb{\theta}_0$, repeat for$k=1,2,\dots$

Select a step size $\eta_k$
Update$\bb{\theta}_k = \bb{\theta}_{k-1} - \eta_k \nabla L(\bb{\theta}_{k-1})$
Check optimality condition at $\bb{\theta}_k$ and stop ifminimum is reached

We will discuss variants of the gradient descent algorithm that are usedin practice in the sequel.

Error backpropagation

The main computation ingredient in the gradient descent algorithm is thegradient of the loss function w.r.t. the network parameters$\bb{\theta}$. Obviously, since an MLP is just a composition ofmulti-variate functions, the gradient can be simply computed invoking the chainrule. However, recall that the output of the network is usually a$k$-dimensional vector, whereas the parameters are a collection of$n_i \times n_{i-1}$ weight matrices and $n_i$-dimensional bias vectors. Thegradient of a vector with respect to a matrix (formally termed the Jacobian) isa third-order tensor, which is not exactly nice to work with.

A much more elegant approach to apply the chain rule takes advantage of thelayered structure of the network. As an illustration, we start with a two-layerMLP of the form

where $\varphi$ and $\phi$ are the two non-linearities, and$\bb{A}$ and $\bb{B}$ are the two weight matrices.We are ignoring the bias terms for the sake of exposition clarity. Toanalyze the influence of the last (second) layer, we denote its input as$\bb{y}’ = \phi(\bb{B} \bb{x} )$, andthe input to the second layer activation function as$\bb{z} = \bb{A}\bb{y}’$. In thisnotation, we have$\bb{y} = \varphi(\bb{A} \bb{y}’)$.According to the chain rule,

For convenience, let us adopt the standard deep learning notation,according to which the derivative of the loss w.r.t. to a parameter$\bb{*}$ is denoted as $\delta \bb{\ast}$. In ourcase,

is the gradient of the loss w.r.t. its input, and$\delta \bb{A}$ is a matrix whose elements are$\frac{\partial L}{\partial a_{ij} }$, etc. In this notation, we canrewrite

We can write $\frac{\partial y_j }{\partial \bb{A}}$ as amatrix of the size of $\bb{A}$, filled with zeros except the$j$-th row, which is given by$\varphi’(z_j) \bb{y}^{\prime \Tr}$. Substituting this resultinto the former sum yields

To analyze the influence of the first layer, we denote $\bb{z}’ =\bb{B}\bb{x}$. To derive the gradient of the loss w.r.t.the first layer parameter $\bb{B}$, we again invoke the chain rule

As before, $\frac{\partial y’_j }{\partial \bb{B}}$ is a matrix ofthe size of $\bb{B}$, filled with zeros except the $j$-th row, whichis given by $\phi’(z’_j) \bb{x}^\Tr$, so

It remains to derive

From $\bb{y} = \varphi(\bb{A} \bb{y}’)$, wehave

from where

We can therefore summarize the chain rule in our two-layer MLP asfollows: First, we propagate the data forward through the network,computing

Then,we propagate the derivatives backward through the network:

The entire procedure, known as error backward propagation orbackpropagation for short can be applied recursively for any number oflayers.

Forward pass:

Starting with $\bb{y}_0 = \bb{x}$, compute for$k=1,\dots, L$

$\bb{z}_k = \bb{W}_k \bb{y}_{k-1}$
$\bb{y}_k = \varphi_k(\bb{z}_k)$

and output $\bb{y} = \bb{y}_L$.

Backward pass:

Starting with $\delta {y}_L = \nabla L( \bb{y} )$, computefor $k=L,L-1,\dots, 1$

$\delta \bb{W}_k = \mathrm{diag}{ \delta \bb{y}_k } \, \mathrm{diag}{ \varphi’_k (\bb{z}_k) } \bb{1} \bb{y}_{k-1}^\Tr$
$\delta \bb{b}_k = \mathrm{diag}{ \delta \bb{y}_k }\varphi’_k (\bb{z}_k)$
$\delta \bb{y}_{k-1} = \diag{ \varphi’_k(\bb{z_k} ) }\bb{W}_k^\Tr \delta \bb{y}_k$

We remind that $\delta \bb{W}_k$ and$\delta \bb{b}_k$ are blocks of coordinates of the gradientof the loss $L$ with respect to the network parameters.

Exploding and vanishing gradients

Backpropagation allows a recursive calculation of the loss gradient w.r.t. theparameters of the network without the need to ever construct the Jacobianmatrices of each layer’s output w.r.t. its input. Note, however, that in orderto compute the gradient w.r.t. the first layer, $\delta \bb{W}_1$,one need to compute the product of $\varphi’_L(\bb{z}_L),\dots,\varphi’_1 (\bb{z}_1)$. This may lead tonumerical instabilities. For example, in a network with $L=20$ layers, a slopeof$\varphi’ = 2$ in each activation function would be amplified by $10^6$.Similarly, a slope of $\varphi’ = 0.5$ would diminish to $10^{-6}$ – practicallyto zero. This problem is known as vanishing and exploding gradients, and itprevented end-to-end supervised training of deep neural networks from randominitialization.

The introduction of ReLU activations mitigated this problem. In ReLU, thederivative is $1$ for positive arguments and $0$ for negative ones. Thisimplies that depending on the path through the network from the output back tothe inputs, the product of the activation derivatives will always be either $0$or $1$. The $0$ derivative for negative arguments could still lead to vanishinggradients, but practice shows that, on the contrary, it helps optimization andpromotes sparse solutions.

ReLU was probably one of the few significant algorithmic changes in theclassical neural networks that enabled deep learning.

Convolutional neural networks

The layers on MLP described so far are termed fully connected in the deeplearning literature, due to the fact that every layer input is connected(through some weight) to every output. For large input and output dimensions,such an architecture results in a vast number of degrees of freedom, whichincreases the network complexity and requires more data to train.

Weight sharing and shift invariance

Weight sharing is a strategy aiming at reducing the layer complexity byreusing the same weights at different parts of the input. For the sake of thefollowing discussion, we assume the input to be discrete and infinitelysupported (i.e., a sequence $\bb{x} = { x_i },~{i \in \mathbb{Z}}$).The output is also assumed to be a sequence,$\bb{y} = { y_i },~{i \in \mathbb{Z}}$.Let us consider the output of the $i$-th neuron,

In many cases such as audio signals, images, etc., it is reasonable to assumethat the same operation is valid at different parts of the signal.Mathematically, this can be expressed by asserting that the action of the neuroncommutes with the action of a translation group. This leads to demanding

for every input $\bb{x}$. Since the non-linearity is appliedelement-wise, the equivalent condition holds on its arguments as well,

This implies $b_i = \mathrm{const}$ and $w_{i-m,j} = w_{i,j+m}$; inother words, if we consider $w_{ij}$ to be the elements of an inifiniteweight matrix, it will have equal elements on each of its diagonals.Another way to express is is by saying that $w_{ij}$ is a function of$i-j$.

Toeplitz operators and convolution

A linear operator exhibiting the above structure is called Toeplitz. Theoutput of a shift-invariant (Toeplitz) neuron can be written as

Note that the weights $\bb{w}$ can now be considered as a window thatis applied to the input at a certain location to produce an output at the samelocation, and then is slided to a different input location to produce thecorresponding output. This operation (the application of the Toeplitz operator)called convolution, denoted as

In this notation, the action of our layer can be written as

In the signal processing jargon, we can say that the input signal$\bb{x}$ is filtered by a filter with the impulse response$\bb{w}$.

Convolutional layer

Neural networks making use of shift-invariant linear operations are calledconvolutional neural networks (CNNs). A convolutional layer accepts an$m$-dimensional vector-valued infinitely supported signal$\bb{x} = (\bb{x}^1,\dots, \bb{x}^m) ={ (x_i^1,\dots, x_i^m) }_{i \in \mathbb{Z}}$;each input dimension is called a channel or feature map.The layer produces an $n$-dimensional infinitely supported signal$\bb{y} = (\bb{y}^1,\dots, \bb{y}^n) ={ (y_i^1,\dots, y_i^n) }_{i \in \mathbb{Z}}$ by applying a bank of filters,

or, explicitly,

In practice, each filter $w^{ij}$ is supported on some small fixeddomain.