HW4 Solution

$35.00 $29.00

Logistic Regression with Newton’s Method Given examples x1; x2; : : : ; xn 2 Rd and associated labels y1; y2; : : : ; yn 2 f0; 1g, the cost function for unregularized logistic regression is J(w) , n 0 yi ln si + (1 yi) ln(1 si) 1 i=1 X B C B…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Categorys:
Tags:

Description

5/5 – (2 votes)
  • Logistic Regression with Newton’s Method

Given examples x1; x2; : : : ; xn 2 Rd and associated labels y1; y2; : : : ; yn 2 f0; 1g, the cost function for unregularized logistic regression is

J(w) ,

n

0

yi ln si + (1 yi) ln(1

si)

1

i=1

X

B

C

B

C

B

C

B

C

@

A

where si , s(xi w), w 2 Rd is a weight vector, and s( ) , 1=(1 + e ) is the logistic function.

Define the n d design matrix X (whose ith row is xiT ), the label n-vector y , [y1

: : : yn]T , and

s , [s1

: : :

sn]>. For an n-vector a, let ln a , [ln a1

: : :

ln an]T . The cost function can be

rewritten in vector form as J(w) =

y ln s

(1

y) ln (1

s).

xy

yf(y) ;

x

(y)

xy

y

(y) ; and

x

y(x)

xy(x)

Hint: Recall matrix calculus identities rx y

=

rx yT + rxy; rx

y z

= rxy z+ rxz y; rxf(y) =

r

r

r

g

=

r

r

g

r

C

r

=

CT (where C is a constant matrix).

  • Derive the gradient rw J(w) of cost J(w) as a matrix-vector expression. Also derive all inter-mediate derivatives in matrix-vector form. Do NOT specify them in terms of their individual components.

2 Derive the Hessian r2w J(w) for the cost function J(w) as a matrix-vector expression.

  • Write the matrix-vector update law for one iteration of Newton’s method, substituting the gradient and Hessian of J(w).

4 You are given four examples x1 = [0:2 3:1]T ; x2 = [1:0 3:0]T ; x3 = [ 0:2 1:2]T ; x4 = [1:0 1:1]T with labels y1 = 1; y2 = 1; y3 = 0; y4 = 0. These points cannot be separated by a line passing through origin. Hence, as described in lecture, append a 1 to each xi2[4] and use a weight vector w 2 R3 whose last component is the bias term (called in lecture). Begin

  • i>

with initial weight w(0) = 1 1 0 . For the following, state only the final answer with four digits after the decimal point. You may use a calculator or write a program to solve for these, but do NOT submit any code for this part.

(a) State the value of s(0) (the initial value of s).

HW4, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2

  1. State the value of w(1) (the value of w after 1 iteration).

  1. State the value of s(1) (the value of s after 1 iteration).

  1. State the value of w(2) (the value of w after 2 iterations).

  • Wine Classi cation with Logistic Regression

The wine dataset data.mat consists of 6,497 sample points, each having 12 features. The descrip-tion of these features is provided in data.mat. The dataset includes a training set of 6,000 sample points and a test set of 497 sample points. Your classifier needs to predict whether a wine is white (class label 0) or red (class label 1).

Begin by normalizing each feature and adding a fictitious dimension. Whenever required, it is recommended that you tune hyperparameter values with cross-validation.

Use of automatic logistic regression libraries/packages is prohibited for this question. If you are coding in python, it is better to use scipy.special.expit for evaluating logistic functions as its code is numerically stable, and doesn’t produce NaN or MathOverflow exceptions.

1 Batch Gradient Descent Update. State the batch gradient descent update law for logistic regression with ‘2 regularization. As this is a “batch” algorithm, each iteration should use every training example. You don’t have to show your derivation. You may reuse results from your solution to question 4.1.

  • Batch Gradient Descent Code. Choose reasonable values for the regularization parameter and step size (learning rate), specify your chosen values, and train your model from question 3.1. Plot the value of the cost function versus the number of iterations spent in training.

  • Stochastic Gradient Descent (SGD) Update. State the SGD update law for logistic regression with ‘2 regularization. Since this is not a “batch” algorithm anymore, each iteration uses just one training example. You don’t have to show your derivation.

  • Stochastic Gradient Descent Code. Choose a suitable value for the step size (learning rate), specify your chosen value, and run your SGD algorithm from question 3.3. Plot the value of the cost function versus the number of iterations spent in training.

Compare your plot here with that of question 3.2. Which method converges more quickly? Briefly describe what you observe.

  • Instead of using a constant step size (learning rate) in SGD, you could use a step size that slowly shrinks from iteration to iteration. Run your SGD algorithm from question 3.3 with a step size t = =t where t is the iteration number and is a hyperparameter you select empirically. Mention the value of chosen. Plot the value of cost function versus the number of iterations spent in training.

How does this compare to the convergence of your previous SGD code?

  • Kaggle. Train your best classifier on the entire training set and submit your prediction on the test sample points to Kaggle. As always for Kaggle competitions, you are welcome to add

HW4, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

or remove features, tweak the algorithm, and do pretty much anything you want to improve your Kaggle leaderboard performance except that you may not replace logistic regression with a wholly di erent learning algorithm. Your code should output the predicted labels in a CSV file.

Report your Kaggle username and your best score, and briefly describe what your best clas-sifier does to achieve that score.

  • Convergence of Batch Gradient Descent in Logistic Regression

In this problem, you will prove that batch gradient descent converges to a unique optimizer of the ‘2-regularized logistic regression cost function.

Given sample points x1; x2; : : : ; xn 2 Rd and associated labels y1; y2; : : : ; yn 2 f0; 1g, define the de-

sign matrix X (whose ith row is xTi ), the label n-vector y , [y1 : : : yn]T , and s(Xw) , [s1 : : : sn]T

containing values si2[n] , 1=(1 + e xi w). For any vector a, let ln a , [ln a1 : : : ln an]T . The cost function for ‘2-regularized logistic regression is

w

2

J(w) ,

k k

y ln s(Xw)

(1 y) ln 1 s(Xw)

2

where > 0 is your choice of the regularization parameter.

  • Let w(t) denote the value of w at iteration t. The initial, arbitrary weight vector is w(0). State the gradient descent update rule for calculating the value of w(t+1) as a function g(w(t)) of the previous weight vector w(t), with a constant step size (learning rate) > 0.

  • Show that J( ) is strictly convex and J(w) has a unique minimizer w .

Hint: f (x) is strictly convex if its Hessian r2x f is positive definite everywhere.

3 Next, show that if the step size (learning rate) is a su ciently small constant, then the update function g( ) is a contraction; i.e., there exists a constant 2 (0; 1) such that for every two w1; w2, kg(w1) g(w2)k kw1 w2k:

Hint: The Mean Value Theorem and the Cauchy–Schwarz inequality might both help.

  • Finally, complete your proof by showing that if the step size is chosen as required in ques-tion 4.3, the weight converges to the unique minimizer; that is, limw(t) = w .

t!1

  • You can refine your proof and guarantee quicker convergence by tightening the contraction in question 4.3. Show that for a clever choice of , which may depend on X and , but crucially

not on the weights w(0) and w , you can guarantee that kw(t) w(0)k / exp

t

8

.

8 +Pi kxik2

Argue that this is also the best exponential rate of convergence one can guarantee when using constant learning rates.

  • If we set = 0, we have the unregularized logistic regression problem. Now that your proof is complete, do you see why the condition > 0 is necessary? What are the reasons that your proof won’t be valid anymore if you choose = 0?

HW4, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

  • A Bayesian Interpretation of Lasso

Suppose you are aware that the labels yi2[n] corresponding to sample points xi2[n] 2 Rd follow the density law

1

2

2

f (yjx; w) ,

p

e (y w x)

=(2

)

2

where > 0 is a known constant and w 2 Rd is a random parameter. Suppose further that experts have told you that

each component of w is independent of the others, and

each component of w has the Laplace distribution with location 0 and scale being a known constant b. That is, each component wi obeys the density law f (wi) = e wij=b=(2b).

Assume the outputs yi2[n] are independent from each other.

Your goal is to find the choice of parameter w that is most likely given the input-output examples (xi; yi)i2[n]. This method of estimating parameters is called maximum a posteriori (MAP); Latin for “maximum [odds] from what follows.”

  1. Derive the posterior probability density law f (wj(xi; yi)i2[n]) for w up to a proportionality constant by applying Bayes’ Theorem and using the densities f (yijxi; w) and f (w). Don’t try to derive an exact expression for f (wj(xi; yi)i2[n]), as it is very involved.

  2. Define the log-likelihood for MAP as ‘(w) , ln f (wjxi2[n]; yi2[n]). Show that maximizing the

MAP log-

likelihood over all choices of

w

is the same as minimizing

P

n

(y

i

w xi

)2

+

kwk1

P

d

i=1

where kwk1 =

j=1 jw jj and is a constant.

  • 1-regularization, ‘2-regularization, and Sparsity

You are given a design matrix X (whose ith row is sample point xTi ) and an n-vector of labels y , [y1 : : : yn]T . For simplicity, assume X is whitened, so X>X = nI. Do not add a fictitious dimension/bias term; for input 0, the output is always 0. Let x i denote the ith column of X.

1. Show that the cost function for ‘1-regularized least squares, J1(w) , kXw yk2 + kwk1 (where > 0), can be rewritten as J1(w) = kyk2 + Pdi=1 f (x i; wi) where f ( ; ) is a suitable function whose first argument is a vector and second argument is a scalar.

  1. Using your solution to question 6.1, derive necessary and su cient conditions for the ith component of the optimizer w of J1( ) to satisfy each of these three properties: wi > 0; wi = 0, and wi < 0.

3. For the optimizer w# of the ‘2-regularized least squares cost function J2(w) , kXw yk2 + kwk2 (where > 0), derive a necessary and su cient condition for w#i = 0, where w#i is the ith component of w#.

  1. A vector is called sparse if most of its components are 0. From your solutions to questions 6.2 and 6.3, which of w and w# is more likely to be sparse? Why?

HW4, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5

HW4 Solution
$35.00 $29.00