Final Exam Solution

Description

5/5 – (2 votes)

All questions have multiple-choice answers ([a], [b], [c], …). You can collaborate with others, but do not discuss the selected or excluded choices in the answers. You can consult books and notes, but not other people’s solutions. Your solutions should be based on your own work. De nitions and notation follow the lectures.

Note about the nal

There are twice as many problems in this nal as there are in a homework set, and some problems require packages that will need time to get to work properly.

Problems cover di erent parts of the course. To facilitate your search for rel-evant lecture parts, an indexed version of the lecture video segments can be found at the Machine Learning Video Library:

http://work.caltech.edu/library

To discuss the nal, you are encouraged to take part in the forum http://book.caltech.edu/bookforum

where there is a dedicated subforum for this nal.

Please follow the forum guidelines for posting answers (see the \BEFORE post-ing answers” announcement at the top there).

Nonlinear transforms

1. The polynomial transform of order Q = 10 applied to X of dimension d = 2 re-sults in a Z space of what dimensionality (not counting the constant coordinate x₀ = 1 or z₀ = 1)?

1. 1. 12

1. 1. 20

1. 1. 35

1. 1. 100

1. 1. None of the above

Bias and Variance

1. Recall that the average hypothesis g was based on training the same model H on di erent data sets D to get g^(D) 2 H, and taking the expected value of g^(D) w.r.t. D to get g. Which of the following models H could result in g 62 H?

1. 1. A singleton H (H has one hypothesis)

1. 1. H is the set of all constant, real-valued hypotheses

1. 1. H is the linear regression model

1. 1. H is the logistic regression model

1. 1. None of the above

Over tting

1. Which of the following statements is false?

1. 1. If there is over tting, there must be two or more hypotheses that have di erent values of E_in.

1. 1. If there is over tting, there must be two or more hypotheses that have di erent values of E_out.

1. 1. If there is over tting, there must be two or more hypotheses that have

di erent values of (E_out E_in).

We can always determine if there is over tting by comparing the values of

^(Eout ^Ein^).

We cannot determine over tting based on one hypothesis only.

1. Which of the following statements is true?

1. 1. Deterministic noise cannot occur with stochastic noise.

1. 1. Deterministic noise does not depend on the hypothesis set.

1. 1. Deterministic noise does not depend on the target function.

1. 1. Stochastic noise does not depend on the hypothesis set.

1. 1. Stochastic noise does not depend on the target distribution.

Regularization

1. The regularized weight w_reg is a solution to:

1		N
		X
minimize		(w^Tx_n y_n)² subject to w^{T T} w C;
minimize	N	(w^Tx_n y_n)² subject to w^{T T} w C;
		n=1

where is a matrix. If w^T_lin ^T w_lin C, where w_lin is the linear regression solution, then what is w_reg?

1. 1. w_reg = w_lin

1. 1. w_reg = w_lin

1. 1. w_reg = ^T w_lin

1. 1. w_reg = C w_lin

1. 1. w_reg = Cw_lin

1. Soft-order constraints that regularize polynomial models can be

1. 1. written as hard-order constraints

1. 1. translated into augmented error

1. 1. determined from the value of the VC dimension

1. 1. used to decrease both E_in and E_out

1. 1. None of the above is true

Regularized Linear Regression

We are going to experiment with linear regression for classi cation on the processed US Postal Service Zip Code data set from Homework 8. Download the data (extracted features of intensity and symmetry) for training and testing:

http://www.amlbook.com/data/zip/features.train

http://www.amlbook.com/data/zip/features.test

(the format of each row is: digit intensity symmetry). We will train two types of binary classi ers; one-versus-one (one digit is class +1 and another digit is class 1, with the rest of the digits disregarded), and one-versus-all (one digit is class +1 and the rest of the digits are class 1). When evaluating E_in and E_out, use binary classi cation error. Implement the regularized least-squares linear regression

for classi cation that minimizes

1	N	w^Tz_n	^yn	2	_+wT_w
	X
N	n=1				N
	n=1

where w includes w₀.

Set = 1 and do not apply a feature transform (i.e., use z = x = (1; x₁; x₂)). Which among the following classi ers has the lowest E_in?

1. 5 versus all

1. 6 versus all

1. 7 versus all

1. 8 versus all

1. 9 versus all

Now, apply a feature transform z = (1; x₁; x₂; x₁x₂; x²₁; x²₂), and set = 1. Which among the following classi ers has the lowest E_out?

1. 0 versus all

1. 1 versus all

1. 2 versus all

1. 3 versus all

1. 4 versus all

If we compare using the transform versus not using it, and apply that to ‘0 versus all’ through ‘9 versus all’, which of the following statements is correct for = 1?

1. Over tting always occurs when we use the transform.

1. The transform always improves the out-of-sample performance by at least 5% (E_out with transform 0:95E_out without transform).

1. The transform does not make any di erence in the out-of-sample perfor-mance.

1. 1. The transform always worsens the out-of-sample performance by at least 5%.

1. 1. The transform improves the out-of-sample performance of ‘5 versus all,’ but by less than 5%.

1. Train the ‘1 versus 5’ classi er with z = (1; x₁; x₂; x₁x₂; x²₁; x²₂) with = 0:01 and = 1. Which of the following statements is correct?

1. 1. Over tting occurs (from = 1 to = 0:01).

1. 1. The two classi ers have the same E_in.

1. 1. The two classi ers have the same E_out.

1. 1. When goes up, both E_in and E_out go up.

1. 1. When goes up, both E_in and E_out go down.

Support Vector Machines

1. Consider the following training set generated from a target function f : X ! f 1; +1g where X = R²

x₁ = (1; 0); y₁ = 1	x₂ = (0; 1); y₂ = 1		x₃ = (0; 1); y₃ = 1
x₄ = ( 1; 0); y₄ = +1 x₅ = (0; 2); y₅ = +1			x₆ = (0; 2); y₆ = +1
	x₇ = ( 2; 0); y₇ = +1
Transform this training set into another two-dimensional space Z
z₁ = x₂²	2x₁ 1	z₂ = x₁²	2x₂ + 1

Using geometry (not quadratic programming), what values of w (without w₀) and b specify the separating plane w^Tz + b = 0 that maximizes the margin in the Z space? The values of w₁; w₂; b are:

1; 1; 0:5

1; 1; 0:5

1; 0; 0:5

0; 1; 0:5

None of the above would work.

Consider the same training set of the previous problem, but instead of explicitly transforming the input space X , apply the hard-margin SVM algorithm with the kernel

K(x; x⁰) = (1 + x^Tx⁰)²

(which corresponds to a second-order polynomial transformation). Set up the expression for L( ₁::: ₇) and solve for the optimal ₁; :::; ₇ (numerically, using a quadratic programming package). The number of support vectors you get is in what range?

1. 0-1

1. 2-3

1. 4-5

1. 6-7

1. >7

Radial Basis Functions

We experiment with the RBF model, both in regular form (Lloyd + pseudo-inverse)

with K centers:

sign	K	w_k exp	jjx _kjj²	+ b^!
	^Xk
	=1

(notice that there is a bias term), and in kernel form (using the RBF kernel in hard-margin SVM):

sign _ny_n exp jjx x_njj² + b :

_n>0

The input space is X = [ 1; 1] [ 1; 1] with uniform probability distribution, and the target is

f(x) = sign(x₂ x₁ + 0:25 sin( x₁))

which is slightly nonlinear in the X space. In each run, generate 100 training points at random using this target, and apply both forms of RBF to these training points. Here are some guidelines:

Repeat the experiment for as many runs as needed to get the answer to be stable (statistically away from ipping to the closest competing answer).

In case a data set is not separable in the ‘Z space’ by the RBF kernel using hard-margin SVM, discard the run but keep track of how often this happens, if ever.

When you use Lloyd’s algorithm, initialize the centers to random points in X and iterate until there is no change from iteration to iteration. If a cluster becomes empty, discard the run and repeat.

1. For = 1:5, how often do you get a data set that is not separable by the RBF

kernel (using hard-margin SVM)? Hint: Run the hard-margin SVM, then check that the solution has E_in = 0.

1. 1. 5% of the time

1. 1. > 5% but 10% of the time

1. 1. > 10% but 20% of the time

1. 1. > 20% but 40% of the time

1. 1. > 40% of the time

1. If we use K = 9 for regular RBF and take = 1:5, how often does the kernel form beat the regular form (excluding runs mentioned in Problem 13 and runs with empty clusters, if any) in terms of E_out?

1. 1. 15% of the time

1. 1. > 15% but 30% of the time

1. 1. > 30% but 50% of the time

1. 1. > 50% but 75% of the time

1. 1. > 75% of the time

1. If we use K = 12 for regular RBF and take = 1:5, how often does the kernel form beat the regular form (excluding runs mentioned in Problem 13 and runs with empty clusters, if any) in terms of E_out?

1. 1. 10% of the time

1. 1. > 10% but 30% of the time

1. 1. > 30% but 60% of the time

1. 1. > 60% but 90% of the time

1. 1. > 90% of the time

1. Now we focus on regular RBF only, with = 1:5. If we go from K = 9 clusters to K = 12 clusters (only 9 and 12), which of the following 5 cases happens most often in your runs (excluding runs with empty clusters, if any)? Up or down means strictly so.

1. 1. E_in goes down, but E_out goes up.

1. 1. E_in goes up, but E_out goes down.

1. 1. Both E_in and E_out go up.

1. 1. Both E_in and E_out go down.

1. 1. E_in and E_out remain the same.

1. For regular RBF with K = 9, if we go from = 1:5 to = 2 (only 1.5 and 2), which of the following 5 cases happens most often in your runs (excluding runs with empty clusters, if any)? Up or down means strictly so.

1. 1. E_in goes down, but E_out goes up.

1. 1. E_in goes up, but E_out goes down.

1. 1. Both E_in and E_out go up.

1. 1. Both E_in and E_out go down.

1. 1. E_in and E_out remain the same.

1. What is the percentage of time that regular RBF achieves E_in = 0 with K = 9 and = 1:5 (excluding runs with empty clusters, if any)?

1. 1. 10% of the time

1. 1. > 10% but 20% of the time

1. 1. > 20% but 30% of the time

1. 1. > 30% but 50% of the time

1. 1. > 50% of the time

Bayesian Priors

1. Let f 2 [0; 1] be the unknown probability of getting a heart attack for people in a certain population. Notice that f is just a constant, not a function, for simplicity. We want to model f using a hypothesis h 2 [0; 1]. Before we see any data, we assume that P (h = f) is uniform over h 2 [0; 1] (the prior). We pick one person from the population, and it turns out that he or she had a heart attack. Which of the following is true about the posterior probability that h = f given this sample point?

1. 1. The posterior is uniform over [0; 1].

1. 1. The posterior increases linearly over [0; 1].

1. 1. The posterior increases nonlinearly over [0; 1].

1. 1. The posterior is a delta function at 1 (implying f has to be 1).

1. 1. The posterior cannot be evaluated based on the given information.

Aggregation

1. Given two learned hypotheses g₁ and g₂, we construct the aggregate hypothesis g given by g(x) = ¹₂ (g₁(x) + g₂(x)) for all x 2 X . If we use mean-squared error, which of the following statements is true?

1. 1. E_out(g) cannot be worse than E_out(g₁).

1. 1. E_out(g) cannot be worse than the smaller of E_out(g₁) and E_out(g₂).

1. 1. E_out(g) cannot be worse than the average of E_out(g₁) and E_out(g₂).

1. 1. E_out(g) has to be between E_out(g₁) and E_out(g₂) (including the end values of that interval).

1. 1. None of the above

Share this:

Share this:

Description

Share this:

Related products

Homework 1 Extracting Data from a CSV file Solution

Homework 5: Heap ADT using STL Solution

Homework 7 Solution

Assignment 1 C++ FUNDAMENTALS Solution

Programming Assignment II CustomFTP Server SOlution