Description
Vectorization
This handout follows a notation where individual matrices and vectors are shown in bold, where matrices are capital and vectors are not. However, when talking about a particular single element of a vector/matrix lowercase non-bold letter is used. Another notational nuance is that indices of matrices or vectors are written in subscript and nth element from a collection is represented using a superscript. As an example is the th element of the matrix (we have more than one vector , and each has some number of elements). All superscripts represent this, unless otherwise stated (such as for power).
Linear Model as a Perceptron
In class we discussed linear regression as a very simple machine learning model which fits a linear polynomial function to the data by estimating the coefficients of the equation:
∑ +
=1
The coefficients are sometimes called the parameters or weights of the model. The term , is called the bias and acts as the intercept or offset for the learned linear model.
We also discussed a more visual and mechanical interpretation of a linear model; the perceptron (shown on the right). This graphical model implements the exact same model as the linear regression equation above. The example on the right shows logistic regression because the sigmoid function is applied to the weighted sum. The vector replaces the parameters as weights.
= ∑ +
1 |
||
1 |
1 |
() |
2 |
||
2 |
3 |
|
-
=1
= ( )
3
Some textbooks use an even more concise representation for , both for notational and computational reasons.
= ∑ ̅̅̅ ̅
=0
The new ̅ and ̅ vectors have one extra element in each (the 0th element). has been augmented with and now has a 1. This is mathematically identical to the equation above but uses just a single multiplication operation instead of the extra addition. Shortcuts like this are how all machine learning libraries speed up training. We will, however, not be using this formulation.
Vectorized form of Perceptron
One advantage of using a perceptron as a linear model is that it is very simple to convert this structure into a multi layered version and creating a non-linear model. The original name of Neural Networks was Multi-Layered Perceptrons for this exact reason. But first, we need a more compact way to represent the mathematical equation, for that we will use linear algebra.
= +
This equation for looks eerily similar to the previous one, so let’s take a closer look. The vector has shape 1 × 3 (for the example in the diagram) and has 3 × 1, is just a single number but it can be thought of as a 1 × 1 vector/matrix.
We can see, using the matrix multiplication rules that the output of this equation will be a single number, identical to the previous versions. We can also use this matrix notation to describe the learning procedure for the perceptron with least squares error. The weight update step discussed in the class is:
= −
-
= ∑( − )2
The superscript
here represent
∀
power
Where is the model output or which depends on , and is the learning rate (step size). We want to minimize so we use its derivative with respect to . We can vectorize this step very easily as well since a perceptron has a single output, so will be a single number. We can obtain a vector containing all the partial derivatives:
-
Δ =[ ⁄
⁄
… ⁄ ]
1
2
Now the weight update step simply becomes:
= − Δ
Not only is the vectorized form more concise, it is slightly more easy to understand as well because you do not need to keep track of indices of matrices.
Vectorized Multi-Layer Perceptron
Now we will join a few perceptrons together to form a Neural Network. We can easily accomplish this by passing on the output of the first layer/set of perceptrons as the input to every successive layer.
First, the boring stuff… nomenclature. The model on |
1 |
2 |
the right is sometimes called a 2-layer model because |
||
it has 2 sets of trainable perceptrons/weight matrices. |
||
In some other texts (and the naming we will follow) it |
||
is called a 3-layer model because visually it has 3 sets |
||
of ‘nodes’ (the first set is the input layer which is |
||
technically constant). Be careful about this difference |
||
when implementing because with our naming scheme, |
||
the ‘first’ is not trainable because it has no weights. |
The step by step derivation and explanation of these equations can be found in the lecture slides along with a worked example. This handout will just list the equations and describe the notation a bit. The equations correspond to the same 3-layer network shown above with sigmoid activation on both layers and MSE loss.
Now let’s see how this structure is built. The first perceptron in the second layer (marked with a blue circle) has 2 inputs and therefore 2 weights. Previously we described this with a 2 × 1 vector, but now we have 3 such neurons/perceptrons. Let’s stack these 3 weight vectors next to each other as columns of 1 which is a 2 × 3 matrix. Each perceptron outputs a single number, so a layer of 3 perceptrons will output a vector ( ) of shape 1 × 3. Each neuron will also have a separate bias (not shown on the diagram) so it will also be a vector of shape 3 × 1. The addition still works out because of ‘broadcasting’ in a Python implementation, but mathematically you can simply transpose .
Forward Pass
1= 1+1
1 = ( 1)
This becomes the input to the next layer of neurons/perceptrons with their own weight matrix constructed in a similar fashion.
2=1 2+2
2 = ( 2)
This is the entire forward pass of a neural network in [basically] 2 matrix products.
Backward Pass
2= 2−
2 = 2(1− 2)
2= 2⨀
1 = ( 2(Δ 2) )
1 = 1(1− 1)
1= 1⨀ 1
The term is derivative of error which can directly be obtained for the output layer as we have the ground truth data . This is not possible for the intermediate/hidden layers, which is where the chain rule and backpropagation algorithm comes in. We can calculate the of any hidden layer using the of the next layer, almost as if the error is propagating backwards. The terms are the derivatives of the activation function with the particular value of . This term changes if you change the activation function of a layer. The ⨀ operator is called the Hadamard product and is a simple elementwise multiplication.
Technically, the s should essentially be used as is in the weight update equation (similar to the perceptron update equation shown previously). The way this handout describes the equations is that it splits the chain rule into two parts. The red portion is represented by in the backward pass step, and the purple portion is calculated in the weight update equations.
∙ ∙
There is no particular reason to setup the structure this way (it is again, an arbitrary choice).
However, the automatic tests provided to you with Assignment 1 use this formulation.
Weight Update
2 = 2 − (( 2) 1)
2=2− ∑ 2
∀
1 = 1 − (( 1) )
1=1− ∑ 1
∀
Importance of Activation Functions
We have seen that the usage of an activation function is integral to the entire learning process of a neural network as it affects the full mathematical pipeline. But let’s also build some intuition as to why it is so important.
Consider the following equation, which describes the feed forward step of a 3-layer neural network. You can suppose that the bias is either 0 or is incorporated in the matrices (like the concise form we discussed above).
= ( ( 1) 2)
The is our activation function, here it is same for both layers but that is an arbitrary choice. To simulate the absence of the activation function let’s use ( ) = as our linear activation function which essentially does nothing.
= 1 2
= ( 1 2)
=
We can see that using a linear activation function (same as not using an activation function) reduces the equation to a form which is pretty much the same as linear regression. That was a linear model which could not learn complex non-linear relationships in the data (limited to the usage of linear regression we have discussed).
So the non-linear powerful nature of a Neural Network depends completely on the presence of a non-linear activation function which is why these functions are sometimes referred to as ‘non-linearities’.