Description
This Assignment carries a total of 4 marks for CS337 Theory and 11 marks for CS335 Lab
-
LASSO and ISTA
Lasso regression uses the ‘1 penalty term and stands for Least Absolute Shrinkage and Selection Operator. It is a widely used tool for achieving sparse solutions to optimization problems. The penalty applied for ‘1 is equal to the absolute value of the magnitude of the coe cients.
-
n
Xi
kyi
xi:wk22 + kwk11
L(w) =
=1
In matric form (using matrix X here instead of as used in the class, LASSO optimizes
min ky Xwk22 + kwk1
w
Similar to ridge regression, a lambda value of zero results in the basic OLS equation, however given a suitable lambda value, lasso regression can drive some coe cients to zero. The larger the value of
1
lambda the more features are shrunk to zero. This can help eliminate some features entirely and give us a subset of predictors that helps mitigate multi-collinearity and model complexity. Predictors not shrunk towards zero signify that they are important and thus ‘1 regularization allows for feature selection (sparse selection).
In this problem, we will implement the Iterative Soft-Thresholding Algorithm (ISTA) for LASSO.
1.1 CS 337: Theoretical Question
Prove that the solution to LASSO is the MAP estimate of Linear regression subject to the Laplacian
prior on weights. (1.5 marks)
1.2 CS 335: Lab Questions
-
Complete the function ista() in the le p1.py. The le utils.py is same as that of previous assignment and you may reuse the code written before. Make sure you write an e cient vectorized implementation (5000 iterations should take less than 10 seconds). You have to stop the algorithm on convergence, i.e., when the norm of the di erence between the weights
of successive iterations falls below = 10 4. You may change the values of (the learning
rate) and the maximum number of iterations if needed. (3 marks)
-
Plot the train and test MSEs for at least 15 di erent values of ranging from 0.1 to 6, with learning rate 0.001 and maximum iterations 10,000, in a single gure. Explain the plot. What
value of is optimal according to you? Why? (1.5 marks)
-
Create a scatter plot of the weights obtained from both LASSO and Ridge regressions in separate gures and compare them. You can use code and the value of for Ridge from previous assignment. For LASSO, use the optimal found in previous part and run the algorithm for enough number of iterations to achieve convergence. Brie y justify your observations. (1.5 marks)
-
Multi-class Classi cation using 1 vs rest Perceptron
A Perceptron keeps a weight vector wy corresponding to each class y. Given a feature vector f, we score each class y with
X
score(f; y) = fiwyi
i
where i iterates over the dimensions in the data. Then we choose the class with the highest score as the predicted label for data instance f.
Learning the weights of the Perceptron
In the basic multi-class Perceptron, we scan the entire training data, one instance at a time. When we come to an instance (f; y), we nd the label with the highest score: y0 = arg max score(f; y00),
y00
while breaking ties arbitrarily. We compare y0 to the true label y. If y0 = y, we have correctly classi ed the instance, and we do nothing. Otherwise, we have wrongly predicted y0 instead of the true label y. This means that wy has scored f lower, and/or wy0 has scored f higher, than what
2
would have been ideally desirable. To avert this error in the future, we update these two weight vectors accordingly: wy = wy + f and wy0 = wy0 f
2.1 CS 337: Theoretical Problem
An alternative to the 1-vs-rest perceptron de ned above is to have a 1-vs-1 perceptron, one for each pair of classes. Compare this proposed 1-vs-1 approach to the previously described 1-vs-rest approach for multi-class classi cation. Outline and brie y justify the advantages and disadvantages
of each. (1 mark)
2.2 CS 335: Lab question
Complete the pred and train functions in p2.py. You can modify max iter if needed. (3 marks)
-
Bias Variance Trade-o
3.1 CS 337: Theoretical Questions
Indicate the e ect of each of the following on the bias and/or variance of the learned model along
with an informal reasoning. (1.5 marks)
-
Increasing the value of in lasso regression.
-
Adding a higher number of training examples for a perceptron
-
Adding more features in the lasso regression task such that the new features are some linear function of the original features. You can assume that Lasso is being solved using the ISTA algorithm.
3.2 CS 335: Lab Questions
Using the code you developed in task 2, compute the test set error for the perceptron using 10; 100; 1000; 10000; 50000; 80000 samples (This can be achieved by using a command line option while running p2.py). Make sure you keep the test set xed. Plot the test set error vs number of train samples, and report your observations in terms of the bias and variance. Also attach the plot
in your answers.pdf le. (0.5 marks)
Complete the function prepare data(X,degree) in the le p3.py to prepare data for a higher order polynomial regression, following the instructions in the function. We then t 4 linear regression models on this augmented data, sampling 40 points at a time from the dataset. Theoretically, for what value of degree do you obtain the least train error on each run? Plot graphs of the test error for degree between 1 and 6, and between train error and degree and for the 4 di erent train sample subsets and explain your observations in terms of the bias variance trade-o and over tting. You may use the graph generated by running p3.py to develop your explanations. The degree of the polynomial tted can be controlled with a command line parameter. Add the plots of error vs
degree to your report as well. (1.5 marks)
3