Intro to Big Data Science: Assignment 4 Solution

$30.00 $24.00

Exercise 1 Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there. Exercise 2 The soft margin support vector classifier (SVC) is to solve the optimization problem: 1 w 2 n (wT x min ¯ C » , s.t. y i ¯ b) ˚ 1 ¡ » , » i ˚…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

Exercise 1

Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.

Exercise 2

The soft margin support vector classifier (SVC) is to solve the optimization problem:

1

w 2

n

(wT x

min

¯

C

»

,

s.t. y

i ¯

b)

˚

1

¡

»

,

»

i ˚

0,

i

˘

1, . . . , n (1)

w,b 2 k k2

i ˘1

i

i

i

X

1. Show that the KKT condition is

8

>>fii ˚ 0,

>

>

>y (wT x ¯b) ¡1 ¯» ˚ 0,

  • iii

>

>

>

fii [yi (wT xi ¯b) ¡1 ¯»i ] ˘ 0,

>

>

>

>

>

>

˚

0,

»ii

0,

>

˚

>

>

>

i

»

i

0,

>

<

>

˘

>

n

>

>

>

> P fii yi ˘ 0,

>

>

n

fii yi xi ,

>w

>

i ˘1

˘ i 1

>

˘

>

>

P

>

fii

C ,

>

i

>

¯

˘

>

>

>

>

>

:

where fii and i are the Lagrange multiplier for the constraints yi (wT xi ¯b) ˚ 1¡»i and »i ˚ 0, respectively.

2. Show that the dual optimization problem is

1

n n

T

n

X X

X

min

j yi y j xi xj ¡

,

fi

fi

fi

fi 2 i ˘1 j ˘1

i

i ˘1

i

n

0 É fii É C ,

i ˘ 1, . . . , n

s.t.

X fii yi ˘ 0,

  • ˘1

    1. Properties of Kernel:

      1. Using the definition of kernel functions in SVM, prove that the kernel K (xi , xj ) is symmetric, where xi and xj are the feature vectors for i -th and j -th exam-ples.

      1. Given n training examples (xi , xj ) for (i , j ˘ 1, . . . , n), the kernel matrix A is an n £ n square matrix, where A(i , j ) ˘ K (xi , xj ). Prove that the kernel matrix A is semi-positive definite.

Exercise 3 (Linear Classifiers) We can also use linear function fw(x) ˘ wT x to make classification. The idea is as follows: if fw(x) ¨ 0, we assign 1 to label y; if fw(x) ˙ 0, we assign -1 to label y. This can be regarded as 0/1-loss minimization:

n

1

X

(x

min

(1

¡

y

sign( f

w

))).

w i ˘1 2

i

i

1. Given a two-class data set {(xi , yi )}ni˘1, we assume that there is a vector w satisfy-ing yi sign( fw(xi )) ¨ 0 for i ˘ 1, . . . , n. Show that the 0/1-loss minimization can be formulated as a linear programming problem:

min 0, subject to Aw ˚ 1,

w

where Ai j ˘ yi xi j , 1 ˘ (1, . . . , 1)T 2 Rn , and the objective is dummy which means we don’t have to minimize it.

  1. Another way to solve 0/1-loss minimization is to replace it by l2-loss (sometimes this is also called surrogate loss):

min

n

2

n

f (x ))2.

X

min (y

X

w i ˘1(1 ¡ yi fw(xi )) ˘

w i ˘1

i ¡ w i

Please give the analytical formula of the solution.

    1. So far we have introduced two loss functions: L0/1(y, f ) ˘ 12 (1¡ysign f ) and L2(y, f ) ˘ (1 ¡ y f )2. Show that the SVM can also be written as a loss minimization problem with the hinge loss function L(y, f ) ˘ [1¡y f ]¯ ˘ max{1¡y f , 0} (the positive part of the function 1 ¡ y f ). Please also plot these three loss functions in the same figure and check their differences.

Exercise 4 (Logistic Regression)

We consider the following models of logistic regression for a binary classification with a sigmoid function g (z) ˘ 1e¡z .

2

Model 1: P(Y ˘ 1jX, w1, w2) ˘ g (w1 X1 ¯ w2 X2);

Model 2: P(Y ˘ 1jX, w1, w2) ˘ g (w0 ¯ w1 X1 ¯ w2 X2). We have three training examples:

x(1) ˘ (1, 1)T ,

x(2) ˘ (1, 0)T , x(3) ˘ (0, 0)T

y(1) ˘ 1,

y(2) ˘ ¡1, y(3) ˘ 1

  1. Does it matter how the third example is labeled in Model 1? i.e., would the learned value of w ˘ (w1, w2) be different if we change the label of the third example to ¡1? Does it matter in Model 2? Briefly explain your answer. (Hint: think of the decision boundary on 2D plane.)

  1. Now, suppose we train the logistic regression model (Model 2) based on the n training examples x(1), . . . , x(n) and labels y(1), . . . , y(n) by maximizing the penalized log-likelihood of the labels:

Xi

log P(y(i )jx(i ), w) ¡

kwk2 ˘

Xi

log g (y(i )wT x(i )) ¡

kwk2

2

2

For large (strong regularization), the log-likelihood terms will behave as linear functions of w:

log g (y(i )wT x(i )) 1 y(i )wT x(i ).

2

Express the penalized log-likelihood using this approximation (with Model 1), and derive the expression for MLE wˆ in terms of and training data {x(i ), y(i )}. Based on this, explain how w behaves as increases. (We assume each x(i ) ˘ (x1(i ), x2(i ))T and y(i ) is either 1 or ¡1)

Exercise 5 (Back propagation in neural network) In a neural network, we have one layer of input x ˘ {xi }, several hidden layers of hidden units {(z(jl ), a(jl ))}, and a final layer of

outputs y. Let wi(lj) be the weight connecting unit j in layer l to unit i in layer l ¯ 1, zi(l) and ai(l ) be the input and output of unit i in layer l before and after activation respec-tively, bi(l) be the bias (intercept) of unit i in layer l ¯ 1. For an L-layer network with an

3

input x and an output y, the forward propagating network is established according to the weighted sum and nonlinear activation f :

z(l ¯1)

˘W (l )a(l ) ¯b(l ), a(l ¯1) ˘ f (z(l ¯1)), for l ˘ 1, . . . , L ¡1

a(1)

˘x, and hW,b (x) ˘ a(L)

We use the square error as our loss function:

J(W, b; x, y) ˘ 1 khW,b (x) ¡ yk2,

2

then the sample mean of loss functions after penalization is

1

n

L¡1 sl

sl¯1

(jli))2.

J(W, b) ˘ n

X

J(W, b; x, y) ¯ 2

(w

X X X

i ˘1

l ˘1 i ˘1 j ˘1

  1. In order to optimize the parameters W and b, we need to use gradient descent method to update their values:

wi( lj)

ˆ wi(lj)

¡fi

@

J(W, b),

bi(l ) ˆ bi(l ) ¡fi

@

J(W, b).

@w(l)

@b(l )

i j

i

The key point is to compute the partial derivatives

@

J(W, b; x, y) and

@

J(W, b; x, y).

(l )

(l )

@w

i j

@b

i

Show that these two partial derivatives can be written in terms of the residual

(l ¯1)

˘

@

J(W, b; x, y):

i

@zi(l¯1)

@

(l)

(l¯1)

J(W, b; x, y) ˘ a j

i

@w(l)

i j

and

@

J(W, b; x, y) ˘ i(l¯1)

@b(l)

i

2. Show that the residuals can be updated according to the following backward rule:

(L)

(y a(L)) f 0(z(L)),

and (l )

(

sl ¯1

(l )

(l ¯1)) f 0

(z(l )),

for l L 1, . . . , 2.

w

i

˘ ¡ i ¡ i

i

i

˘

X

j i

j

i

˘ ¡

j ˘1

4

Intro to Big Data Science: Assignment 4 Solution
$30.00 $24.00