Homework 3 Solution

Description

5/5 – (2 votes)

Conceptual Questions

A1. The answers to these questions should be answerable without referring to external materials. Brie y justify your answers with a few words.

[2 points] True or False: Given a data matrix X 2 R^{n d} where d is much smaller than n, if we project our data onto a k dimensional subspace using PCA where k = rank(X), our projection will have 0 reconstruction error (we nd a perfect representation of our data, with no information loss).

[2 points] True or False: The maximum margin decision boundaries that support vector machines construct have the lowest generalization error among all linear classi ers.

[2 points] True or False: An individual observation x_i can occur multiple times in a single bootstrap sample from a dataset X, even if x_i only occurs once in X.

[2 points] True or False: Suppose that the SVD of a square n n matrix X is U SV ^>, where S is a diagonal n n matrix. Then the rows of V are equal to the eigenvectors of X^>X.

[2 points] True or False: Performing PCA to reduce the feature dimensionality and then applying the Lasso results in an interpretable linear model.

[2 points] True or False: choosing k to minimize the k-means objective (see Equation (1) below) is a good way to nd meaningful clusters.

g. [2 points] Say you trained an SVM classi er with an RBF kernel (K(u; v) = exp(		ku vk₂²	)). It seems to
		2
under t the training set: should you increase or decrease ?	2

Kernels and the Bootstrap

A2. [5 points] Suppose that our inputs x are one-dimensional and that our feature map is in nite-dimensional:

(x) is a vector whose ith component is

x²

xⁱ

for all nonnegative integers i. (Thus, is an in nite-dimensional vector.) Show that K(x; x⁰) = e

(x x⁰)²

is a

kernel function for this feature map, i.e.,

(x)

(x⁰) = e

(x x⁰)²

Hint: Use the Taylor expansion of e^z. (This is the one dimensional version of the Gaussian (RBF) kernel).

A3. This problem will get you familiar with kernel ridge regression using the polynomial and RBF kernels. First, let’s generate some data. Let n = 30 and f (x) = 4 sin( x) cos(6 x²). For i = 1; : : : ; n let each x_i be drawn uniformly at random on [0; 1] and y_i = f (x_i) + _i where _i N (0; 1).

For any function f, the true error and the train error are respectively de ned as

X_i

E_true(f) = E_XY [(f(X)

Y )²];

E_train(f) =

(f(x_i) y_i)²:

Using kernel ridge regression, construct a predictor

X_i

= argjj

yjj

K ;

f(x) =

_ik(x_i; x)

min

where K_i;j = k(x_i; x_j) is a kernel evaluation and is the regularization constant. Include any code you use for your experiments in your submission.

[10 points] Using leave-one-out cross validation, nd a good and hyperparameter settings for the following kernels:

k_poly(x; z) = (1 + x^T z)^d where d 2 N is a hyperparameter,

k_rbf (x; z) = exp( kx zk²) where > 0 is a hyperparameter¹.

Report the values of d, , and the values for both kernels.

[10 points] Let fb_poly(x) and fb_rbf (x) be the functions learned using the hyperparameters you found in part a. For a single plot per function fb2 ffb_poly(x); fb_rbf (x)g, plot the original data f(x_i; y_i)gⁿ_i=1, the true f(x), and fb(x) (i.e., de ne a ne grid on [0; 1] to plot the functions).

[5 points] We wish to build bootstrap percentile con dence intervals for fb_poly(x) and fb_rbf (x) for all x 2 [0; 1] from part b.² Use the non-parametric bootstrap with B = 300 bootstrap iterations to nd 5% and 95% percentiles at each point x on a ne grid over [0; 1].

Speci cally, for each bootstrap sample b 2 f1; : : : ; Bg, draw uniformly at randomly with replacement n samples from f(x_i; y_i)gⁿ_i=1, train an fb_b using the bth resampled dataset, compute fb_b(x) for each x in your ne grid; let the 5th percentile at point x be the largest value such that ¹ ^P^B 1ffbb(x) g :05,

B b=1

de ne the 95% analogously.

Plot the 5 and 95 percentile curves on the plots from part b.

¹Given a dataset x₁; : : : ; x_n 2 R^d, a heuristic for choosing a range of in the right ballpark is the inverse of the median of all

ⁿ₂ squared distances jjx_i x_j jj²₂.

See Hastie, Tibshirani, Friedman Ch. 8.2 for a review of the bootstrap procedure.

Above, f _ig^k_i=1

[5 points] Repeat parts a, b, and c with n = 300, but use 10-fold CV instead of leave-one-out for part a.

[5 points] For this problem, use the fb_poly(x) and fb_rbf (x) learned in part d. Suppose m = 1000 additional samples (x⁰₁; y₁⁰); : : : ; (x⁰_m; y_m⁰) are drawn i.i.d. the same way the rst n samples were drawn.

Use the non-parametric bootstrap with B = 300 to construct a con dence interval on E[(Y

f_poly(X))²

(Y f_rbf (X))²

(x_i⁰

; y_i⁰)

] (i.e. randomly draw with replacement m samples denoted as

_i^m₌₁ from f(x_i⁰; y_i⁰)g_i^m₌₁

and ^b

i=1

(y_i⁰

f_poly(x_i⁰))

(y_i⁰

f_rbf (x_i⁰))

compute

, repeat this B times) and nd 5% and 95%

percentiles. Report

these values.

Using this con dence interval, is there statistically signi cant evidence to suggest that one of fb_rbf and fb_poly is better than the other at predicting Y from X? (Hint: does the con dence interval contain 0?)

k-means clustering

A4. Given a dataset x₁; :::; x_n 2 R^d and an integer 1 k n, recall the following k-means objective function

k				2		1
X X				2		1	X
X X			kx_j	ik₂			X
min
min					;	ⁱ ⁼ j _ij _j			_i ^xj ^:	(1)
1^;:::; k _i=1 _j	2	i			;	ⁱ ⁼ j _ij _j		2	_i ^xj ^:	(1)
	2							2

is a partition of f1; 2; :::; ng. The objective (1) is NP-hard³ to nd a global minimizer of. Nevertheless Lloyd’s algorithm, the commonly-used heuristic which we discussed in lecture, typically works well in practice.

a. [5 points] Implement Lloyd’s algorithm for solving the k-means objective (1). Do not use any o -the-shelf implementations, such as those found in scikit-learn. Include your code in your submission.

b. [5 points] Run the algorithm on the training dataset of MNIST with k = 10, plotting the objective function

(1) as a function of the iteration number. Visualize (and include in your report) the cluster centers as a

28 28 image.

c. [5 points]

For k = f2;

4; 8; 16; 32; 64g run the algorithm on the training dataset to obtain centers f _ig_i^k₌₁.

; y

)

and

(x⁰; y⁰)

denote the training and test sets, respectively, plot the training error

f _n i

^gi=1

i i

^gi=1

x_ik₂² and test error

_i=1 min_j=1;:::;k k _j x_i⁰k₂² as a function of k on the same

_i=1 ^minj=1;:::;k k j

plot.

B1.

Intro to sample complexity

i:i:d:

For i = 1; : : : ; n let (x_i; y_i) P_XY where y_i 2 f 1; 1g and x_i lives in some set X (x_i is not necessarily a

vector). The 0=1 loss, or risk, for a deterministic classi er f : X ! f 1; 1g is de ned as:

R(f) = E_XY [1(f(X) 6= Y )]

where 1(E) is the indicator function for the event E (the function takes the value 1 if E occurs and 0 otherwise). The expectation is with respect to the underlying distribution P_XY on (X; Y ). Unfortunately, we don’t know P_XY exactly, but we do have our i.i.d. samples f(x_i; y_i)gⁿ_i=1 drawn from it. De ne the

To be more precise, it is both NP-hard in d when k = 2 and k when d = 2. See the references on the wikipedia page for k-means for more details.

empirical risk as

1		n
b		X_i

^Rn^{(f) =} _n		1(f(x_i) 6= y_i)
		=1

which is just an empirical estimate of our loss. Suppose that a learning algorithm computes the empirical risk R_n(f) for all f 2 F and outputs the prediction function fbwhich is the one with the smallest empirical risk. (In this problem, we are assuming that F is nite.) Suppose that the best-in-class function f (i.e., the one that minimizes the true 0/1 loss) is:

f = arg min R(f) :

- 2F

[2 points] Suppose that for some f 2 F, we have R(f) > . Show that P(Rb_n(f) = 0) e ⁿ . (You

may use the fact that 1 e .)

b. [2 points] Use the union bound to show that

P r(9f 2 F s.t. R(f) > and Rb_n(f) = 0) jFje ⁿ:

Recall that the union bound says that if A₁; : : : ; A_k are events in a probability space, then

				X
		P r(A₁ [ A₂ [ : : : [ A_k)			P r(A_i):
				1 i k
c. [2 points]	Solve for the minimum such that jFje ⁿ .
d. [4 points]	Use this to show that with probability at least 1
	R	(f) = 0	=	R(f) R(f )		log(jFj= )
	b_n	b	)	b		n

where fb= arg min_f2F Rb_n(f).

Context: Note that among a larger number of functions F there is more likely to exist an fb such that Rb_n(fb) = 0. However, this increased exibility comes at the cost of a worse guarantee on the true error re ected in the larger jFj. This tradeo quanti es how we can choose function classes F that over t. This sample complexity result is remarkable because it depends just on the number of functions in F, not what they look like. This is among the simplest results among a rich literature known as statistical learning theory. Using a similar strategy, one can use Hoe ding’s inequality to obtain a generalization bound when Rb_n(fb) 6= 0.

Neural Networks for MNIST

A5. In Homework 1, we used ridge regression for training a classi er for the MNIST data set. Students who did problem B.2 also used a random feature transform. In Homework 2, we used logistic regression to distinguish between the digits 2 and 7. Students who did problem B.4 extended this idea to multinomial logistic regression to distinguish between all 10 digits. In this problem, we will use PyTorch to build a simple neural network classi er for MNIST to further improve our accuracy.

We will implement two di erent architectures: a shallow but wide network, and a narrow but deeper net-work. For both architectures, we use d to refer to the number of input features (in MNIST, d = 28² = 784), h_i to refer to the dimension of the ith hidden layer and k for the number of target classes (in MNIST, k = 10). For the non-linear activation, use ReLU. Recall from lecture that

(

^x; ^x ⁰

ReLU(x) =

Weight Initialization

Consider a weight matrix W 2 R^{n m} and b 2 Rⁿ. Note that here m refers to the input dimension and n to the output dimension of the transformation W x + b. De ne = p¹_m . Initialize all your weight matrices and biases according to Unif( ; ).

Training

For this assignment, use the Adam optimizer from torch.optim. Adam is a more advanced form of gradient descent that combines momentum and learning rate scaling. It often converges faster than regular gradient descent. You can use either Gradient Descent or any form of Stochastic Gradient Descent. Note that you are still using Adam, but might pass either the full data, a single datapoint or a batch of data to it. Use cross entropy for the loss function and ReLU for the non-linearity.

Implementing the Neural Networks

[10 points] Let W₀ 2 R^{h d}, b₀ 2 R^h, W₁ 2 R^{k h}, b₁ 2 R^k and (z) : R ! R some non-linear activation function. Given some x 2 R^d, the forward pass of the wide, shallow network can be formulated as:

F₁(x) = W₁ (W₀x + b₀) + b₁

Use h = 64 for the number of hidden units and choose an appropriate learning rate. Train the network until it reaches 99% accuracy on the training data and provide a training plot (loss vs. epoch). Finally evaluate the model on the test data and report both the accuracy and the loss.

[10 points] Let W₀ 2 R^h0 ^d, b₀ 2 R^h0 , W₁ 2 R^h1 ^h0 , b₁ 2 R^h1 , W₂ 2 R^{k h}2 , b₂ 2 R^k and (z) : R ! R some non-linear activation function. Given some x 2 R^d, the forward pass of the network can be formulated as:

F₂(x) = W₂ (W₁ (W₀x + b₀) + b₁) + b₂

Use h₀ = h₁ = 32 and perform the same steps as in part a.

[5 points] Compute the total number of parameters of each network and report them. Then compare the number of parameters as well as the test accuracies the networks achieved. Is one of the approaches (wide, shallow vs. narrow, deeper) better than the other? Give an intuition for why or why not.

Using PyTorch: For your solution, you may not use any functionality from the torch.nn module except for torch.nn.functional.relu and torch.nn.functional.cross entropy. You must implement the networks F from scratch. For starter code and a tutorial on PyTorch refer to the section material here and B.4. on the previous homework.

PCA

Let’s do PCA on MNIST dataset and reconstruct the digits in the dimensionality-reduced PCA basis.

You will actually compute your PCA basis using the training dataset only, and evaluate the quality of the basis on the test set, similar to the k-means reconstructions of above. Because 50; 000 training examples are size 28 28 so begin by attening each example to a vector to obtain X_train 2 R^50;000 ^d and X_test 2 R^10;000 ^d for d := 784.

A6. Let 2 R^d denote the average of the training examples in X_train, i.e., = _d¹ X_train^>1^>. Now let = (X_train 1 ^>)^>(X_train 1 ^>)=50000 denote the sample covariance matrix of the training examples, and let

= U DU^T denote the eigenvalue decomposition of .

1. [2 points] If _i denotes the ith largest eigenvalue of , what are the eigenvalues ₁, ₂, ₁₀, ₃₀, and ₅₀?

d	_i?
What is the sum of eigenvalues ^P_i=1

[5 points] Any example x 2 R^d (including those from either the training or test set) can be approximated using just and the rst k eigenvalue, eigenvector pairs, for any k = 1; 2; : : : ; d. For any k, provide a formula for computing this approximation.

[5 points] Using this approximation, plot the reconstruction error from k = 1 to 100 (the X-axis is k and the Y -axis is the mean-squared error reconstruction error) on the training set and the test set (using the

	P	k	_i
	P	i=1	_i
and the basis learned from the training set). On a separate plot, plot 1		i=1	i	from k = 1 to 100.
	P

d. [3 points] Now let us get a sense of what the top P CA directions are capturing. Display the rst 10 eigenvectors as images, and provide a brief interpretation of what you think they capture.

[3 points] Finally, visualize a set of reconstructed digits from the training set for di erent values of k. In particular provide the reconstructions for digits 2; 6; 7 with values k = 5; 15; 40; 100 (just choose an image from each digit arbitrarily). Show the original image side-by-side with its reconstruction. Provide a brief interpretation, in terms of your perceptions of the quality of these reconstructions and the dimensionality you used.

Share this:

Share this:

Description

Share this:

Related products

Lab 4: Implementing Diffuse Shading Solution

Lab 4 Process Management System Calls Solution

Lab 2 File Management System Calls Solution

Lab 1: Checkerboard Solution

Assignment_4 Solution