Homework 3 Solution

Description

5/5 – (1 vote)

Submission: You need to submit three les through MarkUs¹:

Your answers to Questions 1 and 2 as a PDF le titled hw3_writeup.pdf. You can produce the le however you like (e.g. L^AT_EX, Microsoft Word, scanner), as long as it is readable.

Your completed code les q1.py and q2.py

Neatness Point: One of the 10 points will be given for neatness. You will receive this point as long as we don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.

Collaboration. Weekly homeworks are individual work. See the Course Information handout² for detailed policies.

Data. In this assignment we will be working with the Boston Housing dataset³. This dataset contains 506 entries. Each entry consists of a house price and 13 features for houses within the Boston area. We suggest working in python and using the scikit-learn package⁴ to load the data.

Starter Code. Starter code written in Python is provided for Question 2.

[3pts] Robust Regression. One problem with linear regression using squared error loss is that it can be sensitive to outliers. Another loss function we could use is the Huber loss, parameterized by a hyperparameter :

L (y; t) = H (y t)						) if	^ja	^j>
		⁽( a			1
			1	a²		if	a
H	(a) =		2
				j j			j	j
					2

(a) [1pt] Sketch the Huber loss L (y; t) and squared error loss L_SE(y; t) = ¹₂ (y t)² for t = 0, either by hand or using a plotting library. Based on your sketch, why would you expect the Huber loss to be more robust to outliers?

(b) [1pt] Just as with linear regression, assume a linear model:

= w^>x + b:

Give formulas for the partial derivatives @L =@w and @L =@b. (We recommend you nd a formula for the derivative H⁰ (a), and then give your answers in terms of H⁰ (y t).)

https://markus.teach.cs.toronto.edu/csc411-2018-09

http://www.cs.toronto.edu/_~rgrosse/courses/csc411_f18/syllabus.pdf

http://www.cs.toronto.edu/_~delve/data/boston/bostonDetail.html

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

CSC411 Homework 3

1. [1pt] Write Python code to perform (full batch mode) gradient descent on this model. Assume the training dataset is given as a design matrix X and target vector y. Initialize w and b to all zeros. Your code should be vectorized, i.e. you should not have a for loop over training examples or input dimensions. You may nd the function np.where helpful.

Submit your code as q1.py.

[6pts] Locally Weighted Regression.

1. [2pts] Given f(x⁽¹⁾; y⁽¹⁾); ::; (x^(N); y^(N))g and positive weights a⁽¹⁾; :::; a^(N) show that the solution to the weighted least squares problem

w = arg min	1	N	_a(i)_(y(i)		_wT _x(i)₎2 ₊		w ²	(1)
w = arg min		X_i	_a(i)_(y(i)		_wT _x(i)₎2 ₊		w ²	(1)
2		X_i				₂jj	jj
2		=1				₂jj	jj
		=1
is given by the formula					¹ _XT _Ay
w =		X^TAX + I			¹ _XT _Ay			(2)
				and A is a diagonal matrix where A					=
where X is the design matrix (de ned in class)								ii

_a(i)

It may help you to review Section 3.1 of the csc321 notes⁵.

[2pts] Locally reweighted least squares combines ideas from k-NN and linear regression. For each new test example x we compute distance-based weights for each training ex-

(i)

exp(

x x⁽ⁱ⁾jj²=2 ²)

, computes w = arg min

(i)

ample a

i=1 ^a

)

(j)

₂ jjwjj

and

x x

)

_j exp(

predicts y^ = x^T w . Complete the implementation of locally reweighted least

squares by providing the missing parts for q2.py.

Important things to notice while implementing: First, do not invert any matrix, use

a linear solver (numpy.linalg.solve is one example).			Second, notice that		exp(A_i)		=
				P_j		exp(A_j )
	exp(A_i B)					exp(A_i)
		but if we use B = max_j A_j it is much more numerically stable as
	^P_j exp(A_j B)					P
						_j exp(A_j )

over ows/under ows easily. This is handled automatically in the scipy package with the scipy.misc.logsumexp function⁶.

[1pt] Randomly hold out 30% of the dataset as a validation set. Compute the average loss for di erent values of in the range [10,1000] on both the training set and the validation set. Plot the training and validation losses as a function of (using a log scale for ).