Description
All questions have multiple-choice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the homework
The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and ap-proach the problem from di erent angles.
The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.
The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.
You are also encouraged to take part in the forum http://book.caltech.edu/bookforum
where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers” announcement at the top there).
c 2012-2015 Yaser Abu-Mostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Validation
In the following problems, use the data provided in the les in.dta and out.dta for Homework # 6. We are going to apply linear regression with a nonlinear transforma-tion for classi cation (without regularization). The nonlinear transformation is given by 0 through 7 which transform (x1; x2) into
1 x1 x2 x21 x22 x1x2 jx1 x2j jx1 + x2j
To illustrate how taking out points for validation a ects the performance, we will consider the hypotheses trained on Dtrain (without restoring the full D for training after validation is done).
-
Split in.dta into training ( rst 25 examples) and validation (last 10 examples). Train on the 25 examples only, using the validation set of 10 examples to select between ve models that apply linear regression to 0 through k, with k = 3; 4; 5; 6; 7. For which model is the classi cation error on the validation set smallest?
-
-
k = 3
-
-
-
k = 4
-
-
-
k = 5
-
-
-
k = 6
-
-
-
k = 7
-
-
Evaluate the out-of-sample classi cation error using out.dta on the 5 models to see how well the validation set predicted the best of the 5 models. For which model is the out-of-sample classi cation error smallest?
-
-
k = 3
-
-
-
k = 4
-
-
-
k = 5
-
-
-
k = 6
-
-
-
k = 7
-
-
Reverse the role of training and validation sets; now training with the last 10 examples and validating with the rst 25 examples. For which model is the classi cation error on the validation set smallest?
-
-
k = 3
-
-
-
k = 4
-
2
-
-
-
k = 5
-
-
-
-
-
k = 6
-
-
-
-
-
k = 7
-
-
-
-
Once again, evaluate the out-of-sample classi cation error using out.dta on the 5 models to see how well the validation set predicted the best of the 5 models. For which model is the out-of-sample classi cation error smallest?
-
-
-
-
k = 3
-
-
-
-
-
k = 4
-
-
-
-
-
k = 5
-
-
-
-
-
k = 6
-
-
-
-
-
k = 7
-
-
-
-
What values are closest in Euclidean distance to the out-of-sample classi cation error obtained for the model chosen in Problems 1 and 3, respectively?
-
-
-
-
0.0, 0.1
-
-
-
-
-
0.1, 0.2
-
-
-
-
-
0.1, 0.3
-
-
-
-
-
0.2, 0.2
-
-
-
-
-
0.2, 0.3
-
-
Validation Bias
-
-
Let e1 and e2 be independent random variables, distributed uniformly over the interval [0; 1]. Let e = min(e1; e2). The expected values of e1; e2; e are closest to
-
-
-
-
0.5, 0.5, 0
-
-
-
-
-
0.5, 0.5, 0.1
-
-
-
-
-
0.5, 0.5, 0.25
-
-
-
-
-
0.5, 0.5, 0.4
-
-
-
-
-
0.5, 0.5, 0.5
-
-
Cross Validation
7. You are given the data points (x; y): ( 1; 0); ( ; 1); (1; 0); 0; and a choice between two models: constant f h0(x) = b g and linear f h1(x) = ax + b g. For which value of would the two models be tied using leave-one-out cross-validation with the squared error measure?
3
pp
-
-
3 + 4
-
pp
-
-
3 1 p p
-
9+4 6 p p
-
-
-
96
-
-
-
None of the above
-
PLA vs. SVM
Notice: Quadratic Programming packages sometimes need tweaking and have numeri-cal issues, and this is characteristic of packages you will use in practical ML situations. Your understanding of support vectors will help you get to the correct answers.
In the following problems, we compare PLA to SVM with hard margin1 on linearly separable data sets. For each run, you will create your own target function f and data set D. Take d = 2 and choose a random line in the plane as your target function
-
(do this by taking two random, uniformly distributed points on [ 1; 1] [ 1; 1] and taking the line passing through them), where one side of the line maps to +1
and the other maps to 1. Choose the inputs xn of the data set as random points in X = [ 1; 1] [ 1; 1], and evaluate the target function on each xn to get the corresponding output yn. If all data points are on one side of the line, discard the run and start a new run.
Start PLA with the all-zero vector and pick the misclassi ed point for each PLA iteration at random. Run PLA to nd the nal hypothesis gPLA and measure the disagreement between f and gPLA as P[f(x) 6= gPLA(x)] (you can either calculate this exactly, or approximate it by generating a su ciently large, separate set of points to evaluate it). Now, run SVM on the same data to nd the nal hypothesis gSVM by solving
-
min
1
wTw
2
w;b
s:t:
yn wTxn + b 1
using quadratic programming on the primal or the dual problem. Measure the dis-agreement between f and gSVM as P[f(x) 6= gSVM(x)], and count the number of support vectors you get in each run.
-
For N = 10, repeat the above experiment for 1000 runs. How often is gSVM better than gPLA in approximating f? The percentage of time is closest to:
-
20%
-
For hard margin in SVM packages, set C ! 1.
4
-
-
40%
-
-
-
60%
-
-
-
80%
-
-
-
100%
-
-
For N = 100, repeat the above experiment for 1000 runs. How often is gSVM better than gPLA in approximating f? The percentage of time is closest to:
-
-
10%
-
-
-
30%
-
-
-
50%
-
-
-
70%
-
-
-
90%
-
-
For the case N = 100, which of the following is the closest to the average number of support vectors of gSVM (averaged over the 1000 runs)?
-
-
2
-
-
-
3
-
-
-
5
-
-
-
10
-
-
-
20
-
5