Homework 7 Solution

Description

5/5 – (2 votes)

Submission: You need to submit your solutions through MarkUs¹ as the PDF le hw7_writeup.pdf.

Neatness Point: One of the 10 points will be given for neatness. You will receive this point as long as we don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.

Collaboration. Weekly homeworks are individual work. See the Course Information handout² for detailed policies.

[5pts] Representer Theorem. In this question, you’ll prove and apply a simpli ed version of the Representer Theorem, which is the basis for a lot of kernelized algorithms. Consider a linear model:

z = w^> (x)

= g(z);

where is a feature map and g is some function (e.g. identity, logistic, etc.). We are given a training set f(x⁽ⁱ⁾; t⁽ⁱ⁾)g^N_i=1. We are interested in minimizing the expected loss plus an L₂ regularization term:

1		N
		X_i
J (w) = _N		X_i
J (w) = _N		L(y⁽ⁱ⁾; t⁽ⁱ⁾) + ₂ kwk²;
	=1
where L is some loss function. Let	denote the feature matrix
	₌0		(x⁽¹⁾_…	)^>	1	:
		B	_(x(N)₎>		C
	@				A

Observe that this formulation captures a lot of the models we’ve covered in this course, including linear regression, logistic regression, and SVMs.

(a) [2pts] Show that the optimal weights must lie in the row space of .

Hint: Given a subspace S, a vector v can be decomposed as v = v_S +v_?, where v_S is the projection of v onto S, and v_? is orthogonal to S. (You may assume this fact without proof, but you can review it here³.) Apply this decomposition to w and see if you can show something about one of the two components.

https://markus.teach.cs.toronto.edu/csc411-2018-09

http://www.cs.toronto.edu/_~rgrosse/courses/csc411_f18/syllabus.pdf

https://metacademy.org/graphs/concepts/projection_onto_a_subspace

CSC411 Fall 2018 Homework 7

(b) [3pts] Another way of stating the result from part (a) is that w = ^> for some vector . Hence, instead of solving for w, we can solve for . Consider the vectorized form of

the L₂ regularized linear regression cost function:
J (w) =	1	kt wk² +		kwk²:

	2N		2

Substitute in w = ^> , to write the cost function as a function of . Determine the optimal value of . Your answer should be an expression involving , t, and the Gram

matrix K = ^>. For simplicity, you may assume that K is positive de nite. (The algorithm still works if K is merely PSD, it’s just a bit more work to derive.)

Hint: the cost function J ( ) is a quadratic function. Simplify the formula into the following form:

¹₂ ^>A + b^> + c;

for some positive de nite matrix A, vector b and constant c (which can be ignored). You may assume without proof that the minimum of such a quadratic function is given by = A ¹b.

[4pts] Compositional Kernels. One of the most useful facts about kernels is that they can be composed using addition and multiplication. I.e., the sum of two kernels is a kernel, and the product of two kernels is a kernel. We’ll show this in the case of kernels which represent dot products between nite feature vectors.

(a) [1pt] Suppose k₁(x; x⁰) = ₁(x)^> ₁(x⁰) and k₂(x; x⁰) = ₂(x)^> ₂(x⁰). Let k_S be the

sum kernel k_S(x; x⁰) = k₁(x; x⁰) + k₂(x; x⁰). Find a feature map _S such that k_S(x; x⁰) =

_S(x)^> _S(x⁰).

(b) [3pts] Suppose k₁(x; x⁰) = ₁(x)^> ₁(x⁰) and k₂(x; x⁰) = ₂(x)^> ₂(x⁰). Let k_P be

the product kernel k_P(x; x⁰) = k₁(x; x⁰) k₂(x; x⁰). Find a feature map _P such that

k_P(x; x⁰) = _P(x)^> _P(x⁰).

Hint: For inspiration, consider the quadratic kernel from Lecture 20, Slide 11.

Share this:

Share this:

Description

Share this:

Related products

ASSIGNMENT 03

Assignment-7 Binary Search Trees II:Solution

Homework 03 Solution

Lab 4: Bash Script and Bitwise Operations Solution

Problem Set 4 Solution