Problem Set #2 Solved

$30.00 $24.00

[15 points] Logistic Regression: Training stability In this problem, we will be delving deeper into the workings of logistic regression. The goal of this problem is to help you develop your skills debugging machine learning algorithms (which can be very di erent from debugging software in general). We have provided an implementation of logistic regression…

Rate this product

You’ll get a: zip file solution

 

Categorys:

Description

Rate this product
  1. [15 points] Logistic Regression: Training stability

In this problem, we will be delving deeper into the workings of logistic regression. The goal of this problem is to help you develop your skills debugging machine learning algorithms (which can be very di erent from debugging software in general).

We have provided an implementation of logistic regression in src/stability/stability.py, and two labeled datasets A and B in src/stability/ds1 a.csv and src/stability/ds1 b.csv.

Please do not modify the code for the logistic regression training algorithm for this problem. First, run the given logistic regression code to train two di erent models on A and B. You can run the code by simply executing python stability.py in the src/stability directory.

  1. [2 points] What is the most notable di erence in training the logistic regression model on datasets A and B?

  1. [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate your hypothesis for the misbehavior. Remember, you should address why your explanation does not apply to A.

Hint: The issue is not a numerical rounding or over/under ow error.

  1. [5 points] For each of these possible modi cations, state whether or not it would lead to the provided training algorithm converging on datasets such as B. Justify your answers.

    1. Using a di erent constant learning rate.

    1. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1=t2, where t is the number of gradient descent iterations thus far).

    1. Linear scaling of the input features.

    1. Adding a regularization term k k22 to the loss function.

    1. Adding zero-mean Gaussian noise to the training data or labels.

  1. [3 points] Are support vector machines vulnerable to datasets like B? Why or why not? Give an informal justi cation.

CS229 Problem Set #2

3

  1. [22 points] Spam classi cation

In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classi er.

In recent years, spam on electronic media has been a growing concern. Here, we’ll build a classi er to distinguish between real messages, and spam messages. For this class, we will be building a classi er to detect SMS spam messages. We will be using an SMS spam dataset developed by Tiago A. Almedia and Jose Mar a Gomez Hidalgo which is publicly available on http://www.dt.fee.unicamp.br/~tiago/smsspamcollection 1

We have split this dataset into training and testing sets and have included them in this assignment as src/spam/spam train.tsv and src/spam/spam test.tsv. See src/spam/spam readme.txt for more details about this dataset. Please refrain from redistributing these dataset les. The goal of this assignment is to build a classi er from scratch that can tell the di erence the spam and non-spam messages using the text of the SMS message.

  1. [5 points] Implement code for processing the the spam messages into numpy arrays that can

be fed into machine learning models. Do this by completing the get words, create dictionary, and transform text functions within our provided src/spam.py. Do note the correspond-ing comments for each function for instructions on what speci c processing is required.

The provided code will then run your functions and save the resulting dictionary into spam dictionary and a sample of the resulting training matrix into

spam sample train matrix.

In your writeup, report the vocabular size after the pre-processing step. You do not need to include any other output for this subquestion.

  1. [10 points] In this question you are going to implement a naive Bayes classi er for spam classi cation with multinomial event model and Laplace smoothing.

Code your implementation by completing the fit naive bayes model and predict from naive bayes model functions in src/spam/spam.py.

Now src/spam/spam.py should be able to train a Naive Bayes model, compute your predic-tion accuracy and then save your resulting predictions to spam naive bayes predictions. In your writeup, report the accuracy of the trained model on the test set.

Remark. If you implement naive Bayes the straightforward way, you will nd that the

computed p(xjy)

= i p(xijy) often equals zero. This is because p(xjy), which is the

product of many

numbers less than one, is a very small number. The standard computer

Q

representation of real numbers cannot handle numbers that are too small, and instead rounds them o to zero. (This is called \under ow.”) You’ll have to nd a way to compute Naive Bayes’ predicted class labels without explicitly representing very small numbers such as p(xjy). [Hint: Think about using logarithms.]

  1. [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a particular class. We can try to get an informal sense of how indicative token i is for the

SPAM class by looking at:

p(xj = i

y = 1)

= log

P (token i email is SPAM)

:

log

j

j

p(xj = i j

y = 0)

P (token i j email

is NOTSPAM)

Complete the get top five naive bayes words function within the provided code using the above formula in order to obtain the 5 most indicative tokens.

CS229 Problem Set #2

4

Report the top ve words in your writeup.

  1. [2 points] Support vector machines (SVMs) are an alternative machine learning model that we discussed in class. We have provided you an SVM implementation (using a radial basis function (RBF) kernel) within src/spam/svm.py (You should not need to modify that code).

One important part of training an SVM parameterized by an RBF kernel (a.k.a Gaussian kernel) is choosing an appropriate kernel radius parameter.

Complete the compute best svm radius by writing code to compute the best SVM radius which maximizes accuracy on the validation dataset. Report the best kernel radius you obtained in the writeup.

CS229 Problem Set #2

5

  1. [20 points] Bayesian Interpretation of Regularization

Background: In Bayesian statistics, almost every quantity is a random variable, which can either be observed or unobserved. For instance, parameters are generally unobserved random variables, and data x and y are observed random variables. The joint distribution of all the random variables is also called the model (e.g., p(x; y; )). Every unknown quantity can be esti-mated by conditioning the model on all the observed quantities. Such a conditional distribution over the unobserved random variables, conditioned on the observed random variables, is called the posterior distribution. For instance p( jx; y) is the posterior distribution in the machine learning context. A consequence of this approach is that we are required to endow our model parameters, i.e., p( ), with a prior distribution. The prior probabilities are to be assigned before we see the data|they capture our prior beliefs of what the model parameters might be before observing any evidence.

In the purest Bayesian interpretation, we are required to keep the entire posterior distribu-tion over the parameters all the way until prediction, to come up with the posterior predictive distribution, and the nal prediction will be the expected value of the posterior predictive dis-tribution. However in most situations, this is computationally very expensive, and we settle for a compromise that is less pure (in the Bayesian sense).

The compromise is to estimate a point value of the parameters (instead of the full distribution) which is the mode of the posterior distribution. Estimating the mode of the posterior distribution is also called maximum a posteriori estimation (MAP). That is,

MAP = arg max p( jx; y):

Compare this to the maximum likelihood estimation (MLE) we have seen previously:

MLE = arg max p(yjx; ):

In this problem, we explore the connection between MAP estimation, and common regularization techniques that are applied with MLE estimation. In particular, you will show how the choice of prior distribution over (e.g., Gaussian or Laplace prior) is equivalent to di erent kinds of regularization (e.g., L2, or L1 regularization). You will also explore how regularization strengths a ect generalization in part (d).

  1. [3 points] Show that MAP = argmax p(yjx; )p( ) if we assume that p( ) = p( jx). The assumption that p( ) = p( jx) will be valid for models such as linear regression where the input x are not explicitly modeled by . (Note that this means x and are marginally independent, but not conditionally independent when y is given.)

  1. [5 points] Recall that L2 regularization penalizes the L2 norm of the parameters while minimizing the loss (i.e., negative log likelihood in case of probabilistic models). Now we will show that MAP estimation with a zero-mean Gaussian prior over , speci cally N (0; 2I), is equivalent to applying L2 regularization with MLE estimation. Speci cally, show that for some scalar ,

MAP = arg

min

log p(y

x; ) +

jj

2

:

(1)

j

jj2

Also, what is the value of ?

  1. [7 points] Now consider a speci c instance, a linear regression model given by y = T x + where N (0; 2). Assume that the random noise (i) is independent for every training

CS229 Problem Set #2

6

example x(i). Like before, assume a Gaussian prior on this model such that N (0; 2I). For notation, let X be the design matrix of all the training example inputs where each row vector is one example input, and ~y be the column vector of all the example outputs.

Come up with a closed form expression for MAP.

(d) [5 points] Next, consider the Laplace distribution, whose density is given by

fL(zj ; b) = 2b exp

j

b

j

:

1

z

As before, consider a linear regression model given by y = xT + where N (0; 2). Assume a Laplace prior on this model, where each parameter i is marginally independent, and is distributed as i L(0; b).

Show that MAP in this case is equivalent to the solution of linear regression with L1 regularization, whose loss is speci ed as

J( ) = jjX ~yjj22 + jj jj1

Also, what is the value of ?

Note: A closed form solution for linear regression problem with L1 regularization does not exist. To optimize this, we use gradient descent with a random initialization and solve it numerically.

Remark: Linear regression with L2 regularization is also commonly called Ridge regression, and when L1 regularization is employed, is commonly called Lasso regression. These regularizations can be applied to any Generalized Linear models just as above (by replacing log p(yjx; ) with the appropriate family likelihood). Regularization techniques of the above type are also called weight decay, and shrinkage. The Gaussian and Laplace priors encourage the parameter values to be closer to their mean (i.e., zero), which results in the shrinkage e ect.

Remark: Lasso regression (i.e., L1 regularization) is known to result in sparse parameters, where most of the parameter values are zero, with only some of them non-zero.

CS229 Problem Set #2

7

  1. [15 points] Kernelizing the Perceptron

Let there be a binary classi cation problem with y 2 f0; 1g. The perceptron uses hypotheses of the form h (x) = g( T x), where g(z) = sign(z) = 1 if z 0, 0 otherwise. In this problem we will consider a stochastic gradient descent-like implementation of the perceptron algorithm where each update to the parameters is made using only one training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one pass through the entire training set. The update rule for this version of the perceptron algorithm is given by

(i+1) := (i) + (y(i+1) h (i) (x(i+1)))x(i+1)

where (i) is the value of the parameters after the algorithm has seen the rst i training examples.

Prior to seeing any training examples,

(0)

~

is initialized to 0.

  1. [3 points] Let K be a kernel corresponding to some very high-dimensional feature mapping . Suppose is so high-dimensional (say, 1-dimensional) that it’s infeasible to ever represent (x) explicitly. Describe how you would apply the \kernel trick” to the perceptron to make it work in the high-dimensional feature space , but without ever explicitly computing (x).

[Note: You don’t have to worry about the intercept term. If you like, think of as having the property that 0(x) = 1 so that this is taken care of.] Your description should specify:

    1. [1 points] How you will (implicitly) represent the high-dimensional parameter vector (i), including how the initial value (0) = 0 is represented (note that (i) is now a vector whose dimension is the same as the feature vectors (x));

    1. [1 points] How you will e ciently make a prediction on a new input x(i+1). I.e., how you will compute h (i) (x(i+1)) = g( (i)T (x(i+1))), using your representation of (i); and

    1. [1 points] How you will modify the update rule given above to perform an update to on a new training example (x(i+1); y(i+1)); i.e., using the update rule corresponding to the feature mapping :

(i+1) := (i) + (y(i+1) h (i) (x(i+1))) (x(i+1))

  1. [10 points] Implement your approach by completing the initial state, predict, and update state methods of src/perceptron/perceptron.py.

We provide three functions to be used as kernel, a dot-product kernel de ned as:

K(x; z) = x>z;

(2)

a radial basis function (RBF) kernel, de ned as:

K(x; z) = exp

x z

2

;

k

2 2

k2

(3)

and nally the following function:

K(x; z) = (

0

1

x = z

(4)

x = z

6

Note that the

last function is not a kernel

function

(since

its

corresponding matrix

is not a PSD

matrix). However, we are still

interested to

see

what happens when

CS229 Problem Set #2

8

the kernel is invalid. Run src/perceptron/perceptron.py to train kernelized per-ceptrons on src/perceptron/train.csv. The code will then test the perceptron on src/perceptron/test.csv and save the resulting predictions in the src/perceptron/ folder. Plots will also be saved in src/perceptron/.

Include the three plots (corresponding to each of the kernels) in your writeup, and indicate which plot belongs to which function.

  1. [2 points] One of the choices in Q4b completely fails, one works a bit, and one works well in classifying the points. Discuss the performance of di erent choices and why do they fail or perform well?

CS229 Problem Set #2

9

  1. [30 points] Neural Networks: MNIST image classi cation

In this problem, you will implement a simple neural network to classify grayscale images of handwritten digits (0 – 9) from the MNIST dataset. The dataset contains 60,000 training images and 10,000 testing images of handwritten digits, 0 – 9. Each image is 28 28 pixels in size, and is generally represented as a at vector of 784 numbers. It also includes labels for each example, a number indicating the actual digit (0 – 9) handwritten in that image. A sample of a few such images are shown below.

The data and starter code for this problem can be found in

  • src/mnist/nn.py

  • src/mnist/images train.csv

  • src/mnist/labels train.csv

  • src/mnist/images test.csv

  • src/mnist/labels test.csv

The starter code splits the set of 60,000 training images and labels into a set of 50,000 examples as the training set, and 10,000 examples for dev set.

To start, you will implement a neural network with a single hidden layer and cross entropy loss, and train it with the provided data set. Use the sigmoid function as activation for the hidden layer, and softmax function for the output layer. Recall that for a single example (x; y), the cross entropy loss is:

K

X

CS229 Problem Set #2

10

one-hot vector as described above. Let h be the number of hidden units in the neural network, so that weight matrices W [1] 2 Rd h and W [2] 2 Rh K . We also have biases b[1] 2 Rh and b[2] 2 RK . The forward propagation equations for a single input x(i) then are:

a(i) = W [1]>x(i) + b[1] 2 Rh

z(i) = W [2]>a(i) + b[2] 2 RK

y^(i) = softmax(z(i)) 2 RK

where is the sigmoid function.

For n training examples, we average the cross entropy loss over the n examples.

J(W [1]; W [2]; b[1]; b[2]) = 1

n

CE(y(i); y^(i)) =

1

n K

y(i) log y^(i):

X

X X

k

k

n

n

i=1

i=1 k=1

The starter code already converts labels into one hot representations for you.

Instead of batch gradient descent or stochastic gradient descent, the common practice is to use mini-batch gradient descent for deep learning tasks. In this case, the cost function is de ned as follows:

B

JMB = B1 X CE(y(i); y^(i))

i=1

where B is the batch size, i.e. the number of training example in each mini-batch.

CS229 Problem Set #2

11

we need 50 iterations to cover the entire training data. The images are pre-shu ed. So you don’t need to randomly sample the data, and can just create mini-batches sequentially.

Train the model with mini-batch gradient descent as described above. Run the training for 30 epochs. At the end of each epoch, calculate the value of loss function averaged over the entire training set, and plot it (y-axis) against the number of epochs (x-axis). In the same image, plot the value of the loss function averaged over the dev set, and plot it against the number of epochs.

Similarly, in a new image, plot the accuracy (on y-axis) over the training set, measured as the fraction of correctly classi ed examples, versus the number of epochs (x-axis). In the same image, also plot the accuracy over the dev set versus number of epochs.

Submit the two plots (one for loss vs epoch, another for accuracy vs epoch) in your writeup.

Also, at the end of 30 epochs, save the learnt parameters (i.e all the weights and biases) into a le, so that next time you can directly initialize the parameters with these values from the le, rather than re-training all over. You do NOT need to submit these parameters.

Hint: Be sure to vectorize your code as much as possible! Training can be very slow otherwise.

  1. [7 points] Now add a regularization term to your cross entropy loss. The loss function will become

JMB= B

CE(y(i); y^(i))!

+ jjW [1]jj2

+ jjW [2]jj2

1

B

i=1

X

Be careful not to regularize the bias/intercept term. Set to be 0.0001. Implement the regularized version and plot the same gures as part (a). Be careful NOT to include the regularization term to measure the loss value for plotting (i.e., regularization should only be used for gradient calculation for the purpose of training).

Submit the two new plots obtained with regularized training (i.e loss (without regularization term) vs epoch, and accuracy vs epoch) in your writeup.

Compare the plots obtained from the regularized model with the plots obtained from the non-regularized model, and summarize your observations in a couple of sentences.

As in the previous part, save the learnt parameters (weights and biases) into a di erent le so that we can initialize from them next time.

  1. [3 points] All this while you should have stayed away from the test data completely. Now that you have convinced yourself that the model is working as expected (i.e, the observations you made in the previous part matches what you learnt in class about regularization), it is nally time to measure the model performance on the test set. Once we measure the test set performance, we report it (whatever value it may be), and NOT go back and re ne the model any further.

Initialize your model from the parameters saved in part (a) (i.e, the non-regularized model), and evaluate the model performance on the test data. Repeat this using the parameters saved in part (b) (i.e, the regularized model).

Report your test accuracy for both regularized model and non-regularized model. Brie y (in one sentence) explain why this outcome makes sense” You should have accuracy close to 0.92870 without regularization, and 0.96760 with regularization. Note: these accuracies assume you implement the code with the matrix dimensions as speci ed in the comments,

CS229 Problem Set #2

12

which is not the same way as speci ed in your code. Even if you do not precisely these numbers, you should observe good accuracy and better test accuracy with regularization.

Problem Set #2 Solved
$30.00 $24.00