Machine Learning Homework #3

$24.99 $18.99

Conceptual Questions The answers to these questions should be answerable without referring to external materials. Briefly justify your answers with a few words. a. [2 points] Consider training kernel ridge regression with a Gaussian RBF kernel (K(u, v) = exp − ∥u−v∥22 2σ2 ). It seems to underfit the training set: should you increase or…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

Conceptual Questions

  1. The answers to these questions should be answerable without referring to external materials. Briefly justify your answers with a few words.

a. [2 points] Consider training kernel ridge regression with a Gaussian RBF kernel (K(u, v) = exp −

u−v22

2

).

It seems to underfit the training set: should you increase or decrease σ?

  1. [2 points] True or False: Training deep neural networks requires minimizing a non-convex loss func-tion, and therefore gradient descent might not reach the globally-optimal solution.

  1. [2 points] True or False: It is a good practice to initialize all weights to zero when training a deep neural network.

  1. [2 points] True or False: We use non-linear activation functions in a neural network’s hidden layers so that the network learns non-linear decision boundaries.

  1. [2 points] True or False: Given a neural network, the time complexity of the backward pass step in the backpropagation algorithm can be prohibitively larger compared to the relatively low time complexity of the forward pass step.

What to Submit:

Writeup: For each part a-e, 1-2 sentences containing your answer.

Coding

Introduction to PyTorch

  1. PyTorch is a great tool for developing, deploying and researching neural networks and other gradient-based algorithms. In this problem we will explore how this package is built and re-implement some of its core components. First start by reading README.md file provided in intro pytorch subfolder. A lot of problem statements will overlap between here, readmes and comments in functions.

    1. [10 points] You will start by implementing components of our own PyTorch modules. You can find these in folders: layers, losses and optimizers. Almost each file there should contain at least one problem function, including exact directions for what to achieve in this problem. Finally, you should implement functions in train.py file.

    1. [5 points] Next we will use the above module to perform hyperparameter search. Here we will also treat loss function as a hyper-parameter. However, because cross-entropy and MSE are different, we are going to use two different files:crossentropy search.py and mean squared error search.py. For each you will need to build and train (in the provided order) 5 models:

Linear neural network (Single layer, no activation function)

NN with one hidden layer (2 units) and sigmoid activation function after the hidden layer

NN with one hidden layer (2 units) and ReLU activation function after the hidden layer

NN with two hidden layer (each with 2 units) and Sigmoid, ReLU activation functions after first and second hidden layers, respectively

NN with two hidden layer (each with 2 units) and ReLU, Sigmoid activation functions after first and second hidden layers, respectively

For each loss function, submit a plot of losses from training and validation sets. All models should be on the same plot (10 lines per plot), with two plots total (1 for MSE, 1 for cross-entropy).

  1. [5 points] For each loss function, report the best performing architecture (best performing is defined here as achieving the lowest validation loss at any point during the training), and plot it’s guesses on test set. You should use function plot model guesses from train.py file. Finally, report accuracy of that model on a test set.

  1. [3 points] Is there a big gap in performance between between MSE and Cross-Entropy models? If so, explain why it occurred? If not explain why different loss functions achieve similar performance? Answer in 2-4 sentences.

What to Submit:

Part b: 2 plots (one per loss function), with 10 lines each, showing both training and validation loss of each model. Make sure plots are titled, and have proper legends.

Part c: Names of best performing models (i.e. descriptions of their architectures), and their accuracy on test set.

Part c: 2 scatter plots (one per loss function), with predictions of best performing models on test set.

Part d: 2-4 sentence written reponse to provided questions.

Code on Gradescope through coding submission

Resources

For the next question you will use a lot of PyTorch. Please feel free to reference PyTorch Documentation, when needed.

If you do not have access to GPU, you might find Google Colaboratory useful. It allows you to use a cloud GPU for free. To enable it make sure: “Runtime” → “Change runtime type” → “Hardware accelerator” is set to “GPU”. When submitting please download and submit a .py version of your notebook.

Neural Networks for MNIST

  1. In previous homeworks, we used ridge regression for training a classifier for the MNIST data set. Similarly in previous homework, we used logistic regression to distinguish between the digits 2 and 7.

In this problem, we will use PyTorch to build a simple neural network classifier for MNIST to further improve our accuracy.

We will implement two different architectures: a shallow but wide network, and a narrow but deeper net-work. For both architectures, we use d to refer to the number of input features (in MNIST, d = 282 = 784), hi to refer to the dimension of the i-th hidden layer and k for the number of target classes (in MNIST, k = 10). For the non-linear activation, use ReLU. Recall from lecture that

x, x ≥ 0

ReLU(x) =

Weight Initialization

Consider a weight matrix W Rn×m and b Rn. Note that here m refers to the input dimension and n to the output dimension of the transformation x W x + b. Define α =1m . Initialize all your weight matrices and biases according to Unif(−α, α).

Training

For this assignment, use the Adam optimizer from torch.optim. Adam is a more advanced form of gradient descent that combines momentum and learning rate scaling. It often converges faster than regular gradient descent in practice. You can use either Gradient Descent or any form of Stochastic Gradient Descent. Note that you are still using Adam, but might pass either the full data, a single datapoint or a batch of data to it. Use cross entropy for the loss function and ReLU for the non-linearity.

Implementing the Neural Networks

  1. [10 points] Let W0 Rh×d, b0 Rh, W1 Rk×h, b1 Rk and σ(z) : R → R some non-linear activation function applied element-wise. Given some x Rd, the forward pass of the wide, shallow network can be formulated as:

F1(x) := W1σ(W0x + b0) + b1

Use h = 64 for the number of hidden units and choose an appropriate learning rate. Train the network until it reaches 99% accuracy on the training data and provide a training plot (loss vs. epoch). Finally evaluate the model on the test data and report both the accuracy and the loss.

  1. [10 points] Let W0 Rh0×d, b0 Rh0 , W1 Rh1×h0 , b1 Rh1 , W2 Rk×h1 , b2 Rk and σ(z) : R → R some non-linear activation function. Given some x Rd, the forward pass of the network can be formulated as:

F2(x) := W2σ(W1σ(W0x + b0) + b1) + b2

Use h0 = h1 = 32 and perform the same steps as in part a.

  1. [5 points] Compute the total number of parameters of each network and report them. Then compare the number of parameters as well as the test accuracies the networks achieved. Is one of the approaches (wide, shallow vs. narrow, deeper) better than the other? Give an intuition for why or why not.

Using PyTorch: For your solution, you may not use any functionality from the torch.nn module except for torch.nn.functional.relu, torch.nn.functional.cross entropy, torch.nn.parameter.Parameter and torch.nn.Module. You must implement the networks F1 and F2 from scratch.

What to Submit:

Parts a-b: Provide a plot of the training loss versus epoch. In addition evaluate the model trained on the test data and report the accuracy and loss.

Part c: Report the number of parameters for the network trained in part (a) and for the network trained in part (b). Provide a comparison of the two networks as described in part in 1-2 sentences.

Code on Gradescope through coding submission.

Administrative

[2 points] About how many hours did you spend on this homework? There is no right or wrong answer 🙂

4

Machine Learning Homework #3
$24.99 $18.99