Name: Intro to Big Data Science: Assignment 4 Solution
SKU: 11392
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Exercise 1

Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.

Exercise 2

The soft margin support vector classifier (SVC) is to solve the optimization problem:

w ²

(w^T x

min

s.t. y

i ^¯

i ^˚

1, . . . , n (1)

_w_,_b₂^{k k}2

i ˘1

1. Show that the KKT condition is

>_>fi_i ˚ 0,

>y (w^T x ¯b) ¡1 ¯» ˚ 0,

>	fi_i [y_i (w^T x_i ¯b) ¡1 ¯»_i ] ˘ 0,
>
>
>
>
>
>	„		˚		0,
>	»_iⁱ		˚		0,
>			˚
>
>
>	„	i	»	i		0,
>		i		i
<
>					˘
>	n
>	n
>
>
^{> P} fi_i y_i ˘ 0,
>
>					n	fi_i y_i x_i ,
>_w					n	fi_i y_i x_i ,
>	i ˘1
>			^˘i 1
>					˘
>					˘
>					P
>	fi_i				P	C ,
>	fi_i				„_i	C ,
>			¯			˘
>			¯			˘
>
>
>
>
:

where fi_i and „_i are the Lagrange multiplier for the constraints y_i (w^T x_i ¯b) ˚ 1¡»_i and »_i ˚ 0, respectively.

2. Show that the dual optimization problem is

1	n n				T	n
	X X					X
min				j ^yi ^y j ^x_i ^xj ^¡				,
		fi	fi			fi

^fi² i ˘1 j ˘1		i				i ˘1	i
	n				0 É fi_i É C ,			i ˘ 1, . . . , n
s.t.	^X fi_i y_i ˘ 0,

1. Properties of Kernel:

1. 1. Using the definition of kernel functions in SVM, prove that the kernel K (x_i , x_j ) is symmetric, where x_i and x_j are the feature vectors for i -th and j -th exam-ples.

1. 1. Given n training examples (x_i , x_j ) for (i , j ˘ 1, . . . , n), the kernel matrix A is an n £ n square matrix, where A(i , j ) ˘ K (x_i , x_j ). Prove that the kernel matrix A is semi-positive definite.

Exercise 3 (Linear Classifiers) We can also use linear function f_w(x) ˘ w^T x to make classification. The idea is as follows: if f_w(x) ¨ 0, we assign 1 to label y; if f_w(x) ˙ 0, we assign -1 to label y. This can be regarded as 0/1-loss minimization:

n	1
X							(x
min		(1	¡	y	sign( f	w		))).
min		(1		y	sign( f			))).
^w i ˘1 ²				i			i

1. Given a two-class data set {(x_i , y_i )}ⁿ_i_˘1, we assume that there is a vector w satisfy-ing y_i sign( f_w(x_i )) ¨ 0 for i ˘ 1, . . . , n. Show that the 0/1-loss minimization can be formulated as a linear programming problem:

min 0, subject to Aw ˚ 1,

where A_{i j} ˘ y_i x_{i j} , 1 ˘ (1, . . . , 1)^T 2 Rⁿ , and the objective is dummy which means we don’t have to minimize it.

Another way to solve 0/1-loss minimization is to replace it by l₂-loss (sometimes this is also called surrogate loss):

min	n	2	n	f (x ))².
	X		min (y
			X
_w _i_˘1⁽¹ ^¡ ^yi ^fw⁽^xi ⁾⁾ ^˘			^w i ˘1	i ^¡ w i

Please give the analytical formula of the solution.

1. So far we have introduced two loss functions: L_0/1(y, f ) ˘ ¹₂ (1¡ysign f ) and L₂(y, f ) ˘ (1 ¡ y f )². Show that the SVM can also be written as a loss minimization problem with the hinge loss function L(y, f ) ˘ [1¡y f ]_¯ ˘ max{1¡y f , 0} (the positive part of the function 1 ¡ y f ). Please also plot these three loss functions in the same figure and check their differences.

Exercise 4 (Logistic Regression)

We consider the following models of logistic regression for a binary classification with a sigmoid function g (z) ˘ _1¯¹_e_¡_z .

– Model 1: P(Y ˘ 1jX, w₁, w₂) ˘ g (w₁ X₁ ¯ w₂ X₂);

– Model 2: P(Y ˘ 1jX, w₁, w₂) ˘ g (w₀ ¯ w₁ X₁ ¯ w₂ X₂). We have three training examples:

x⁽¹⁾ ˘ (1, 1)^T ,	x⁽²⁾ ˘ (1, 0)^T , x⁽³⁾ ˘ (0, 0)^T
y⁽¹⁾ ˘ 1,	y⁽²⁾ ˘ ¡1, y⁽³⁾ ˘ 1

Does it matter how the third example is labeled in Model 1? i.e., would the learned value of w ˘ (w₁, w₂) be different if we change the label of the third example to ¡1? Does it matter in Model 2? Briefly explain your answer. (Hint: think of the decision boundary on 2D plane.)

Now, suppose we train the logistic regression model (Model 2) based on the n training examples x⁽¹⁾, . . . , x⁽ⁿ⁾ and labels y⁽¹⁾, . . . , y⁽ⁿ⁾ by maximizing the penalized log-likelihood of the labels:

X_i	log P(y⁽ⁱ ⁾jx⁽ⁱ ⁾, w) ¡	‚	kwk² ˘	^Xi	log g (y⁽ⁱ ⁾w^T x⁽ⁱ ⁾) ¡	‚	kwk²

		2				2

For large ‚ (strong regularization), the log-likelihood terms will behave as linear functions of w:

_log_g₍_y(i )_wT _x(i )₎ _… ¹ _y(i )_wT _x(i )_.

Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE wˆ in terms of ‚ and training data {x⁽ⁱ ⁾, y⁽ⁱ ⁾}. Based on this, explain how w behaves as ‚ increases. (We assume each x⁽ⁱ ⁾ ˘ (x₁⁽ⁱ ⁾, x₂⁽ⁱ ⁾)^T and y⁽ⁱ ⁾ is either 1 or ¡1)

Exercise 5 (Back propagation in neural network) In a neural network, we have one layer of input x ˘ {x_i }, several hidden layers of hidden units {(z⁽_j^l ⁾, a⁽_j^l ⁾)}, and a final layer of

outputs y. Let w_i⁽^l_j⁾ be the weight connecting unit j in layer l to unit i in layer l ¯ 1, z_i⁽^l⁾ and a_i⁽^l ⁾ be the input and output of unit i in layer l before and after activation respec-tively, b_i⁽^l⁾ be the bias (intercept) of unit i in layer l ¯ 1. For an L-layer network with an

input x and an output y, the forward propagating network is established according to the weighted sum and nonlinear activation f :

_z(l ¯1)	˘W ⁽^l ⁾a⁽^l ⁾ ¯b⁽^l ⁾, a⁽^l ^¯1) ˘ f (z⁽^l ^¯1)), for l ˘ 1, . . . , L ¡1
_a(1)	˘x, and h_W_,_b (x) ˘ a⁽^L⁾

We use the square error as our loss function:

J(W, b; x, y) ˘ ¹ kh_W_,_b (x) ¡ yk²,

then the sample mean of loss functions after penalization is

1	n	_‚ L¡1 ^sl		^sl¯1	⁽_j^l_i⁾)².
J(W, b) ˘ _n	X	J(W, b; x, y) ¯ ₂		(w
			X X X
	i ˘1		l ˘1 i ˘1 j ˘1

In order to optimize the parameters W and b, we need to use gradient descent method to update their values:

w_i⁽ ^l_j⁾

ˆ w_i⁽^l_j⁾

¡fi

J(W, b),

b_i⁽^l ⁾ ˆ b_i⁽^l ⁾ ¡fi

J(W, b).

_@w(l)

_@b(l )

i j

The key point is to compute the partial derivatives

J(W, b; x, y) and

J(W, b; x, y).

(l )

i j

Show that these two partial derivatives can be written in terms of the residual

_–(l ¯1)	˘	@		J(W, b; x, y):
i		_@z_i(l¯1)
		@				(l)	_–(l¯1)
					J(W, b; x, y) ˘ a _j		i
			_@w(l)

i j

and	@	J(W, b; x, y) ˘ –_i⁽^l^¯1)

	_@b(l)
	i

2. Show that the residuals can be updated according to the following backward rule:

_–(L)	(y a⁽^L⁾) f ⁰(z⁽^L⁾),		and –⁽^l ⁾	(	^sl ¯1	(l )	_–(l ¯1)₎ _f 0	(z⁽^l ⁾),	for l L 1, . . . , 2.
					w
i	^{˘ ¡} i ^¡ _i	i	i	˘	X	j i	j	i	˘ ¡
					j ˘1

Intro to Big Data Science: Assignment 4 Solution

Share this:

Share this:

Description

Share this:

Related products

ASSIGNMENT 03

Programming Project 3 GritVM Interpreter Solution

Exercise 5: Regularized Linear Regression and Bias v.s. Variance Solution

Assignment-(H) Solution

Homework 6 Mountain Paths – Part II Solution