Description

Name: Homework #3 Solution
SKU: 19306
Price: 24.99 USD
Availability: InStock

p(YjX, a) =

5/5 – (2 votes)

Course: Machine Learning (CS405) – Professor: Qi Hao

Question 1

Consider a data set in which each data point t_n is associated with a weighting factor r_n > 0, so that the sum-of-squares error function becomes

	1	N
E_D(w) =		å r_nft_n w^Tf(x_n)g².
E_D(w) =	2	å r_nft_n w^Tf(x_n)g².
		n=1

Find an expression for the solution w that minimizes this error function. Give two alternative interpretations of the weighted sum-of-squares error function in terms of (i) data dependent noise variance and (ii) replicated data points.

Question 2

We saw in Section 2.3.6 that the conjugate prior for a Gaussian distribution with un-known mean and unknown precision (inverse variance) is a normal-gamma distri-bution. This property also holds for the case of the conditional Gaussian distribution p(tjx, w, b) of the linear regression model. If we consider the likelihood function,

p(tjX, w, b) = Õ N (t_njw^Tf(x_n), b ¹)

n=1

then the conjugate prior for w and b is given by

p(w, b) = N (wjm₀, b ¹S₀)Gam(bja₀, b₀).

Show that the corresponding posterior distribution takes the same functional form,so that

p(w, bjt) = N (wjm_N, b ¹S_N )Gam(bja_N, b_N ).

and find expressions for the posterior parameters m_N, S_N, a_N, and b_N.

Machine Learning (CS405) – Homework #3

Question 3

Show that the integration over w in the Bayesian linear regression model gives the result

expf E(w)gdw = expf E(m_N )g(2p)^M/2jAj ^1/2.

Hence show that the log marginal likelihood is given by

ln p(tja, b) =	M	ln a +	N	ln b	E(m_N )	1		ln jAj	N	ln(2p).

	2		2				2		2

Question 4

Consider real-valued variables X and Y . The Y variable is generated, conditional on

X, from the following process:

N(0, s²)

Y = aX + e

where every e is an independent variable, called a noise term, which is drawn from a Gaussian distribution with mean 0, and standard deviation s. This is a one-feature linear regression model, where a is the only weight parameter. The conditional prob-ability of Y has distribution p(YjX, a) N(aX, s²), so it can be written as

Assume we have a training dataset of n pairs (X_i, Y_i) for i = 1…n, and s is known. Derive the maximum likelihood estimate of the parameter a in terms of the train-ing example X_i⁰s and Y_i⁰s. We recommend you start with the simplest form of the problem:

F(a) = ¹₂ å_i(Y_i aX_i)²

Question 5

If a data point y follows the Poisson distribution with rate parameter q, then the probability of a single observation y is

	q^ye	q
p(yjq) =			, for y = 0, 1, 2, . . .
	y!

You are given data points y₁, . . . , y_n independently drawn from a Poisson distribution with parameter q . Write down the log-likelihood of the data as a function of q .

Machine Learning (CS405) – Homework #3

Question 6

Suppose you are given n observations, X₁, . . . , X_n, independent and identically dis-tributed with a Gamma(a, l) distribution. The following information might be useful for the problem.

(a) If X Gamma(a, l), then E[X] =

and E[X²] =

a(a+1)

_l2

(b) The probability density function of X Gamma(a, l) is f_X (x) =

G(a)

where the function G is only dependent on a and not l.

Suppose, we are given a known, fixed value for a. Compute the maximum likelihood estimator for l.

Machine Learning (CS405) – Homework #3

Program Question

In this question, we will try to use logistic regression to solve a binary classification

problem. Given some information of a house, such as area and the number of living rooms, would it be expensive? We would like logisticRegressionScikit()topred1ifitisexpensve, and 0 otherwise. We will use the hw3_house_sales.zip dataset.

We will first implement it with python Scikit learn package, and then try o imple-ment it by updating weights with gradient descent. We will derive the gradient formula, and use Stochastic gradient descent and AdaGrad to calculate the weights.

(a) Logistic regression with Scikit. Fill in the func-tion using the Scikit toolbox.

Report the weights and prediction accuracy here in your submitted file.

(b) Gradient derivation. Assume a sigmoid is applied to a linear function of the input features:

1
h_w(x) =
	1 + e	w^T x

Assume lso that P(y = 1jx; w) = h_w(x), P(y = 0jx; w) = 1 h_w(x). Calcu-

late the maximum likelihood estimation L(w) = P(YjX; w), then formulate the tochastic gradient ascent rule. Please writing out the log likelihood, calculating

the isticRegressdervactveand writing out the update formula step by step.LregrsnSGD()withsimplegradientdescent. Fill in thesigmoidvationfunction.Todothat,twohelperfunctionsmodel_optimize(),tocalculatethesigmoidfunctionresult,and c()

to calculate the gradient of w, will be needed. Both helper functions can be used in the following AdaGrad optimization function. Use a learning rate of 10 ⁴, run with 2000 iterations. Keep track of the accuracy every 100 iterations in the training set (no need to report). It will be used later.

Report weights, training accuracy and test accuracy here in your submit-ted file. Your final score will depends on correct sigmoid_activation(), model_optimize(), LogisticRegressionSGD() functions.

(d) Logistic regression with AdaGrad. Fill in the LogisticRegressionAda() function. Use a learning rate of 10 ⁴, run with 2000 iterations. Keep tracks of the accuracy every 100 iterations in the training set (no need to report). It will be used later.

Report weights, training accuracy and test accuracy here in your submitted file.

(e) Comparision of Scikit, SGD and AdaGrad convergence. Plot the loss function of SGD and AdaGrad over 2000 iterations on both the training and test data. What do you observe? Which one has better accuracy on the test dataset? Why might that be the case?

Reference. The datasets and questions are from website and University of Pennsylva-nia.