HW4 Solution

Description

5/5 – (2 votes)

Logistic Regression with Newton’s Method

Given examples x₁; x₂; : : : ; x_n 2 R^d and associated labels y₁; y₂; : : : ; y_n 2 f0; 1g, the cost function for unregularized logistic regression is

J(w) ,	n	0	y_i ln s_i + (1 y_i) ln(1	s_i)	1
J(w) ,	i=1	0	y_i ln s_i + (1 y_i) ln(1	s_i)	1
	X	B			C
		B			C
		B			C
		B			C
		@			A

where s_i , s(x_i w), w 2 R^d is a weight vector, and s( ) , 1=(1 + e ) is the logistic function.

Define the n d design matrix X (whose i^th row is x_i^T ), the label n-vector y , [y₁

: : : y_n]^T , and

s , [s₁

: : :

s_n]^>. For an n-vector a, let ln a , [ln a₁

: : :

ln a_n]^T . The cost function can be

rewritten in vector form as J(w) =

y ln s

y) ln (1

s).

_xy

_yf(y) ;

(y)

_xy

(y) ; and

y(x)

_xy(x)

Hint: Recall matrix calculus identities r_x y

r_x y^T + r_xy; r_x

y z

= r_xy z+ r_xz y; r_xf(y) =

C^T (where C is a constant matrix).

Derive the gradient r_w J(w) of cost J(w) as a matrix-vector expression. Also derive all inter-mediate derivatives in matrix-vector form. Do NOT specify them in terms of their individual components.

2 Derive the Hessian r²_w J(w) for the cost function J(w) as a matrix-vector expression.

Write the matrix-vector update law for one iteration of Newton’s method, substituting the gradient and Hessian of J(w).

4 You are given four examples x₁ = [0:2 3:1]^T ; x₂ = [1:0 3:0]^T ; x₃ = [ 0:2 1:2]^T ; x₄ = [1:0 1:1]^T with labels y₁ = 1; y₂ = 1; y₃ = 0; y₄ = 0. These points cannot be separated by a line passing through origin. Hence, as described in lecture, append a 1 to each x_i2[4] and use a weight vector w 2 R³ whose last component is the bias term (called in lecture). Begin

with initial weight w⁽⁰⁾ = 1 1 0 . For the following, state only the final answer with four digits after the decimal point. You may use a calculator or write a program to solve for these, but do NOT submit any code for this part.

(a) State the value of s⁽⁰⁾ (the initial value of s).

State the value of w⁽¹⁾ (the value of w after 1 iteration).

State the value of s⁽¹⁾ (the value of s after 1 iteration).

State the value of w⁽²⁾ (the value of w after 2 iterations).

Wine Classi cation with Logistic Regression

The wine dataset data.mat consists of 6,497 sample points, each having 12 features. The descrip-tion of these features is provided in data.mat. The dataset includes a training set of 6,000 sample points and a test set of 497 sample points. Your classifier needs to predict whether a wine is white (class label 0) or red (class label 1).

Begin by normalizing each feature and adding a fictitious dimension. Whenever required, it is recommended that you tune hyperparameter values with cross-validation.

Use of automatic logistic regression libraries/packages is prohibited for this question. If you are coding in python, it is better to use scipy.special.expit for evaluating logistic functions as its code is numerically stable, and doesn’t produce NaN or MathOverflow exceptions.

1 Batch Gradient Descent Update. State the batch gradient descent update law for logistic regression with ‘₂ regularization. As this is a “batch” algorithm, each iteration should use every training example. You don’t have to show your derivation. You may reuse results from your solution to question 4.1.

Batch Gradient Descent Code. Choose reasonable values for the regularization parameter and step size (learning rate), specify your chosen values, and train your model from question 3.1. Plot the value of the cost function versus the number of iterations spent in training.

Stochastic Gradient Descent (SGD) Update. State the SGD update law for logistic regression with ‘₂ regularization. Since this is not a “batch” algorithm anymore, each iteration uses just one training example. You don’t have to show your derivation.

Stochastic Gradient Descent Code. Choose a suitable value for the step size (learning rate), specify your chosen value, and run your SGD algorithm from question 3.3. Plot the value of the cost function versus the number of iterations spent in training.

Compare your plot here with that of question 3.2. Which method converges more quickly? Briefly describe what you observe.

Instead of using a constant step size (learning rate) in SGD, you could use a step size that slowly shrinks from iteration to iteration. Run your SGD algorithm from question 3.3 with a step size _t = =t where t is the iteration number and is a hyperparameter you select empirically. Mention the value of chosen. Plot the value of cost function versus the number of iterations spent in training.

How does this compare to the convergence of your previous SGD code?

Kaggle. Train your best classifier on the entire training set and submit your prediction on the test sample points to Kaggle. As always for Kaggle competitions, you are welcome to add

or remove features, tweak the algorithm, and do pretty much anything you want to improve your Kaggle leaderboard performance except that you may not replace logistic regression with a wholly di erent learning algorithm. Your code should output the predicted labels in a CSV file.

Report your Kaggle username and your best score, and briefly describe what your best clas-sifier does to achieve that score.

Convergence of Batch Gradient Descent in Logistic Regression

In this problem, you will prove that batch gradient descent converges to a unique optimizer of the ‘₂-regularized logistic regression cost function.

Given sample points x₁; x₂; : : : ; x_n 2 R^d and associated labels y₁; y₂; : : : ; y_n 2 f0; 1g, define the de-

sign matrix X (whose i^th row is x^T_i ), the label n-vector y , [y₁ : : : y_n]^T , and s(Xw) , [s₁ : : : s_n]^T

containing values s_i2[n] , 1=(1 + e ^xi ^w). For any vector a, let ln a , [ln a₁ : : : ln a_n]^T . The cost function for ‘₂-regularized logistic regression is

	w	2
J(w) ,	k k		y ln s(Xw)	(1 y) ln 1 s(Xw)
J(w) ,	2		y ln s(Xw)	(1 y) ln 1 s(Xw)

where > 0 is your choice of the regularization parameter.

Let w^(t) denote the value of w at iteration t. The initial, arbitrary weight vector is w⁽⁰⁾. State the gradient descent update rule for calculating the value of w^(t+1) as a function g(w^(t)) of the previous weight vector w^(t), with a constant step size (learning rate) > 0.

Show that J( ) is strictly convex and J(w) has a unique minimizer w .

Hint: f (x) is strictly convex if its Hessian r²_x f is positive definite everywhere.

3 Next, show that if the step size (learning rate) is a su ciently small constant, then the update function g( ) is a contraction; i.e., there exists a constant 2 (0; 1) such that for every two w₁; w₂, kg(w₁) g(w₂)k kw₁ w₂k:

Hint: The Mean Value Theorem and the Cauchy–Schwarz inequality might both help.

Finally, complete your proof by showing that if the step size is chosen as required in ques-tion 4.3, the weight converges to the unique minimizer; that is, limw^(t) = w .

t!1

You can refine your proof and guarantee quicker convergence by tightening the contraction in question 4.3. Show that for a clever choice of , which may depend on X and , but crucially

not on the weights w⁽⁰⁾ and w , you can guarantee that kw^(t) w⁽⁰⁾k / exp	t	8		.
			8 +^P_i kx_ik²

Argue that this is also the best exponential rate of convergence one can guarantee when using constant learning rates.

If we set = 0, we have the unregularized logistic regression problem. Now that your proof is complete, do you see why the condition > 0 is necessary? What are the reasons that your proof won’t be valid anymore if you choose = 0?

A Bayesian Interpretation of Lasso

Suppose you are aware that the labels y_i2[n] corresponding to sample points x_i2[n] 2 R^d follow the density law

	1		2	2
f (yjx; w) ,	p		_e (y w x)	=(2	)

where > 0 is a known constant and w 2 R^d is a random parameter. Suppose further that experts have told you that

each component of w is independent of the others, and

each component of w has the Laplace distribution with location 0 and scale being a known constant b. That is, each component w_i obeys the density law f (w_i) = e ^wi^j=b=(2b).

Assume the outputs y_i2[n] are independent from each other.

Your goal is to find the choice of parameter w that is most likely given the input-output examples (x_i; y_i)_i2[n]. This method of estimating parameters is called maximum a posteriori (MAP); Latin for “maximum [odds] from what follows.”

Derive the posterior probability density law f (wj(x_i; y_i)_i2[n]) for w up to a proportionality constant by applying Bayes’ Theorem and using the densities f (y_ijx_i; w) and f (w). Don’t try to derive an exact expression for f (wj(x_i; y_i)_i2[n]), as it is very involved.
Define the log-likelihood for MAP as ‘(w) , ln f (wjx_i2[n]; y_i2[n]). Show that maximizing the

MAP log-

likelihood over all choices of

is the same as minimizing

w x_i

)²

kwk₁

i=1

where kwk₁ =

_j=1 jw _jj and is a constant.

‘₁-regularization, ‘₂-regularization, and Sparsity

You are given a design matrix X (whose i^th row is sample point x^T_i ) and an n-vector of labels y , [y₁ : : : y_n]^T . For simplicity, assume X is whitened, so X^>X = nI. Do not add a fictitious dimension/bias term; for input 0, the output is always 0. Let x _i denote the i^th column of X.

1. Show that the cost function for ‘₁-regularized least squares, J₁(w) , kXw yk² + kwk₁ (where > 0), can be rewritten as J₁(w) = kyk² + ^P^d_i=1 f (x _i; w_i) where f ( ; ) is a suitable function whose first argument is a vector and second argument is a scalar.

Using your solution to question 6.1, derive necessary and su cient conditions for the i^th component of the optimizer w of J₁( ) to satisfy each of these three properties: w_i > 0; w_i = 0, and w_i < 0.

3. For the optimizer w^# of the ‘₂-regularized least squares cost function J₂(w) , kXw yk² + kwk² (where > 0), derive a necessary and su cient condition for w^#_i = 0, where w^#_i is the ith component of w^#.

A vector is called sparse if most of its components are 0. From your solutions to questions 6.2 and 6.3, which of w and w^# is more likely to be sparse? Why?

Share this:

Share this:

Description

Share this:

Related products

Programming II Assignment 4: Patient Location

Lab 5 Task 4 System Calls Summary Solution

Task 5 Process Synchronization Solution

Assignment_4 Solution

ASSIGNMENT-04 Solution