Description
All questions have multiple-choice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.
Note about the homework
The goal of the homework is to facilitate a deeper understanding of the course material. The questions are not designed to be puzzles with catchy answers. They are meant to make you roll up your sleeves, face uncertainties, and ap-proach the problem from di erent angles.
The problems range from easy to di cult, and from practical to theoretical. Some problems require running a full experiment to arrive at the answer.
The answer may not be obvious or numerically close to one of the choices, but one (and only one) choice will be correct if you follow the instructions precisely in each problem. You are encouraged to explore the problem further by experimenting with variations on these instructions, for the learning bene t.
You are also encouraged to take part in the forum http://book.caltech.edu/bookforum
where there are many threads about each homework set. We hope that you will contribute to the discussion as well. Please follow the forum guidelines for posting answers (see the \BEFORE posting answers” announcement at the top there).
c 2012-2015 Yaser Abu-Mostafa. All rights reserved. No redistribution in any format. No translation or derivative products without written permission.
1
Over tting and Deterministic Noise
-
-
Deterministic noise depends on H, as some models approximate f better than others. Assume H0 H and that f is xed. In general (but not necessarily in all cases), if we use H0 instead of H, how does deterministic noise behave?
-
In general, deterministic noise will decrease.
-
-
-
-
-
In general, deterministic noise will increase.
-
-
-
-
-
In general, deterministic noise will be the same.
-
-
-
-
-
There is deterministic noise for only one of H and H0.
-
-
Regularization with Weight Decay
In the following problems use the data provided in the les
http://work.caltech.edu/data/in.dta
http://work.caltech.edu/data/out.dta
as a training and test set respectively. Each line of the les corresponds to a two-dimensional input x = (x1; x2), so that X = R2, followed by the corresponding label from Y = f 1; 1g. We are going to apply Linear Regression with a non-linear transformation for classi cation. The nonlinear transformation is given by
(x1; x2) = (1; x1; x2; x21; x22; x1x2; jx1 x2j; jx1 + x2j):
Recall that the classi cation error is de ned as the fraction of misclassi ed points.
-
Run Linear Regression on the training set after performing the non-linear trans-formation. What values are closest (in Euclidean distance) to the in-sample and out-of-sample classi cation errors, respectively?
-
-
0.03, 0.08
-
-
-
0.03, 0.10
-
-
-
0.04, 0.09
-
-
-
0.04, 0.11
-
-
-
0.05, 0.10
-
-
3. Now add weight decay to Linear Regression, that is, add the term
7
w2
N
i=0
k
i
values to
to the squared in-sample error, using
= 10 . What are the closest
P
the in-sample and out-of-sample classi cation errors, respectively, for k =
3?
Recall that the solution for Linear Regression with Weight Decay was derived in class.
2
-
-
-
0.01, 0.02
-
-
-
-
-
0.02, 0.04
-
-
-
-
-
0.02, 0.06
-
-
-
-
-
0.03, 0.08
-
-
-
-
-
0.03, 0.10
-
-
-
-
Now, use k = 3. What are the closest values to the new in-sample and out-of-sample classi cation errors, respectively?
-
-
-
-
0.2, 0.2
-
-
-
-
-
0.2, 0.3
-
-
-
-
-
0.3, 0.3
-
-
-
-
-
0.3, 0.4
-
-
-
-
-
0.4, 0.4
-
-
-
-
What value of k, among the following choices, achieves the smallest out-of-sample classi cation error?
-
-
-
-
2
-
-
-
-
-
1
-
-
-
-
-
0
-
-
-
-
-
1
-
-
-
-
-
2
-
-
-
-
What value is closest to the minimum out-of-sample classi cation error achieved by varying k (limiting k to integer values)?
-
-
-
-
0.04
-
-
-
-
-
0.06
-
-
-
-
-
0.08
-
-
-
-
-
0.10
-
-
-
-
-
0.12
-
-
Regularization for Polynomials
Polynomial models can be viewed as linear models in a space Z, under a nonlinear transform : X ! Z. Here, transforms the scalar x into a vector z of Legendre
3
polynomials, z = (1; L1(x); L2(x); :::; LQ(x)). Our hypothesis set will be expressed as a linear combination of these polynomials,
-
HQ = (h j h(x) = wTz =
Q
wqLq(x)
);
=0
Xq
where L0(x) = 1.
-
Consider the following hypothesis set de ned by the constraint:
H(Q; C; Qo) = fh j h(x) = wTz 2 HQ; wq = C for q Qog;
which of the following statements is correct:
-
-
-
H(10; 0; 3) [ H(10; 0; 4) = H4
-
-
-
-
-
H(10; 1; 3) [ H(10; 1; 4) = H3
-
-
-
-
-
H(10; 0; 3) \ H(10; 0; 4) = H2
-
-
-
-
-
H(10; 1; 3) \ H(10; 1; 4) = H1
-
-
-
-
-
None of the above
-
-
Neural Networks
-
-
A fully connected Neural Network has L = 2; d(0) = 5; d(1) = 3; d(2) = 1. If only products of the form wij(l)x(il 1), wij(l) j(l), and x(il 1) j(l) count as operations (even for x(0l 1) = 1), without counting anything else, which of the following is the closest to the total number of operations in a single iteration of backpropagation (using SGD on one data point)?
-
-
-
-
30
-
-
-
-
-
35
-
-
-
-
-
40
-
-
-
-
-
45
-
-
-
-
-
50
-
-
Let us call every ‘node’ in a Neural Network a unit, whether that unit is an input variable or a neuron in one of the layers. Consider a Neural Network that has 10 input units (the constant x(0)0 is counted here as a unit), one output unit, and 36 hidden units (each x(0l) is also counted as a unit). The hidden units can be arranged in any number of layers l = 1; ; L 1, and each layer is fully connected to the layer above it.
4
-
What is the minimum possible number of weights that such a network can have?
-
-
46
-
-
-
47
-
-
-
56
-
-
-
57
-
-
-
58
-
-
What is the maximum possible number of weights that such a network can have?
-
-
386
-
-
-
493
-
-
-
494
-
-
-
509
-
-
-
510
-
5