Intro to Big Data Science: Assignment 6 Solution

$30.00 $24.00

Exercise 1 Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there. Exercise 2 Recall the definition of information entropy, H(P) ˘ ¡Pni˘1 pi log pi , which means the maximal information contained in probability distribution P. Let X and Y be two random variables. The entropy H(X , Y )…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

Exercise 1

Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.

Exercise 2 Recall the definition of information entropy, H(P) ˘ ¡Pni˘1 pi log pi , which means the maximal information contained in probability distribution P. Let X and Y be two random variables. The entropy H(X , Y ) for the joint distribution of (X , Y ) is defined similarly. The conditional entropy is defined as:

H(X jY ) ˘ ¡Xj

P(Y ˘ y j )H(X jY ˘ y j )

˘ ¡X

P(Y ˘ y j )(XP(X ˘ xi jY ˘ y j ) log P(X ˘ xi jY ˘ y j ))

    • i

  1. Show that H(X , Y ) ˘ H(X ) ¯ H(Y jX ) ˘ H(Y ) ¯ H(X jY ).

  1. The mutual information (information gain) is defined as I (X ; Y ) ˘ H(XH(X jY ) ˘ H(Y ) ¡ H(Y jX ). Show that if X and Y are independent, then I (X ; Y ) ˘ 0

1

kxi ¡x j k22

2¾2

3.

I (X ; Y ) ˘ DK L (p(X , Y )kp(X )p(Y )).

K L

k

˘ ¡

Pi ˘1 pi log pi

. Show that

Define the Kullback-Leibler divergence as D

(P Q)

n

qi

    1. (Optional) Furthermore, show that DK L(PkQ) ˚ 0 for any P and Q by using Jensen’s inequality. As a result, I (X ; Y ) ˚ 0.

Exercise 3 (EM Algorithm, you may need to carefully read Section 8.5.2 in the book “Elements of Statistical Learning” before solving this problem)

Imagine a class where the probability that a student gets an “A” grade is P(A) ˘ 12 , a “B” grade is P(B) ˘ , a “C” grade is P(C ) ˘ 2, and a “D” grade is P(D) ˘ 12 ¡ 3. We are told that c students get a “C” and d students get a “D”. We don.t know how many

students got exactly an “A” or exactly a “B”. But we do know that h students got either an “A” or “B”. Let a be the number of students getting “A” and b be the number of students getting “B”. Therefore, a and b are unknown parameters with a ¯ b ˘ h. Our goal is to use expectation maximization to obtain a maximum likelihood estimate of .

1. Use Multinoulli distribution to compute the log-likelihood function l(, a, b).

2. Expectation step: Given ˆ

(m)

, compute the expected values aˆ

(m)

ˆ

(m)

of a and

and b

b respectively.

3. Maximization step: Plug aˆ

(m)

ˆ

(m)

into the log-likelihood function l(, a, b)

and b

and calculate for the maximum likelihood estimate ˆ(m¯1) of , as a function of ˆ(m).

    1. Iterating between the E-step and M-step will always converge to a local optimum of (which may or may not also be a global optimum)? Explain why in short.

Problem 4 (Spectral Clustering)

    1. We consider the 2-clustering problem, in which we have N data points x1:N to be grouped in two clusters, denoted by A and B. Given the N by N affinity matrix W (Remember that in class we define the affinity matrix in the way that the diago-nal entries are zero for undirected graphs), consider the following two problems:

Min-cut: minimize iP2A jP2B Wi j ;

Pj 1 Wi j

¯

P

i 1P

P

i A

j B Wi j

Normalized cut: minimize

i 2A

j 2B

Wi j

i

2A

j 2BWi j

.

P

2

P

N

P

N

P

2

˘

˘

  1. The data points are shown in Figure (a) above. The grid unit is 1. Let Wi j ˘ e¡kxi ¡xj k22 , give the clustering results of min-cut and normalized cut respec-tively (Please draw a rough sketch and give the separation boundary in the answer book).

  1. The data points are shown in Figure (b) above. The grid unit is 1. Let Wi j ˘

e¡ , describe the clustering results of min-cut algorithm for ¾2 ˘ 50

and ¾2 ˘ 0.5 respectively (Please draw a rough sketch and give the separation boundaries for each case of ¾2 in the answer book).

2

(a) (b)

  1. Now back to the setting of the 2-clustering problem shown in Figure (a). The grid unit is 1.

a) If we use Euclidean distance to construct the affinity matrix W as follows:

Wi j ˘ (

1,

if kxi ¡xj k22 É ¾2;

0,

otherwise.

What ¾2 value would you choose? Briefly explain.

      1. The next step is to compute the first k ˘ 2 dominant eigenvectors of the affin-ity matrix W . For the value of ¾2 you chose in the previous question, can you compute analytically the eigenvalues corresponding to the first two eigen-vectors? If yes, compute and report the eigenvalues. If not, briefly explain.

Exercise 5 (Dimensionality Reduction)

    1. (PCA vs. LDA) Plot the directions of the first PCA (plot (a)) and LDA (plot (b)) components in the following figures respectively.

    1. (PCA and SVD) Given 6 data points in 5D space, (1, 1, 1, 0, 0), (¡3, ¡3, ¡3, 0, 0), (2, 2, 2, 0, 0), (0, 0, 0, ¡1, ¡1), (0, 0, 0, 2, 2), (0, 0, 0, ¡1, ¡1). We can represent these data points by

a 6 £5 matrix X, where each row corresponds to a data point:

  • 1

1

1

1

0

0

C

B ¡3

¡3

¡3

0

0

  • C

X ˘

B

2

2200C

B

C

  • 0 0¡1¡1C

BCB

  • C

@

0

0

0

2

2

A

0

0

0

¡1

¡1

a) What is the sample mean of the data set?

3

(c) First PCA component (d) First LDA component

  1. What is the SVD of the data matrix X ˘ UDVT , where U and V satisfy UT U ˘ VT V ˘ I2? Note that the SVD for this matrix must take the following form,

where a, b, c, d, ¾1, ¾2 are the parameters you need to decide.

  • 1

  • 0

B ¡3a

0

C

X ˘

B

20

b

C

£µ

01

¾2

£µ

0 0 0 d d

B

a

0

C

¾

0

c c c

0 0

B

C

B

C

B

¡2b

C

B

0

C

@

A

      • b

    1. What is first principle component for the original data points?

    1. If we want to project the original data points {xi }6i˘1 into 1D space by principle component you choose, what is the sample variance of the projected data {xˆi }6i˘1?

    1. For the projected data in d), now if we represent them in the original 5-d space, what is the reconstruction error 16 P6i˘1 kxi ¡xˆi k22?

Exercise 6(PCA as factor analysis and SVD, optional)

PCA of a set of data in Rp provide a sequence of best linear approximations to those data, of all ranks q É p. Denote the observations by x1, x2, . . . , xN , and consider the rank-q linear model for representing them

  • (fi) ˘ ¯Vq fi

where is a location vector in Rp , Vq is a p £ q matrix with q orthogonal unit vectors as columns, and fi is a q vector of parameters. If we can find such a model, then we can reconstruct each xi by a low dimensional coordinate vector fii through

xi ˘ f (fii ) ¯i ˘ ¯Vq fii ¯i

(1)

4

where i 2 Rp are noise terms. Then PCA amounts to minimizing this reconstruction error by least square method

N

min X x V fi 2

k i ¡ ¡ q i k

,{fii },Vq i ˘1

1. Assume Vq is known and treat and fii as unknowns. Show that the least square problem

N

V fi

2

minx

X

,{fii } i ˘1 k i ¡ ¡ q i k

is minimized by

1

N

ˆ ˘ x¯ ˘

X

xi ,

(2)

N

i ˘1

fiˆi ˘ VTq (xi ¡x¯).

(3)

Also show that the solution for ˆ is not unique. Give a family of solutions for ˆ.

2. For the standard solution (2), we are left with solving

Vq

N

k(xi ¡x) ¡Vq Vq (xi

¡x)k

2

˘

Vq

X(Ip ¡Vq Vq )X

·.

(4)

i ˘1

min

X

¯

T

¯

min Tr

˜

T

˜ T

Here we introduce the centered sample matrix

X˜ (IN

1

JN )X

0

(x1 ¡x¯)T

1

RN £p

..

˘ ¡

˘ B

.

C2

N

(xN

x¯)T

@

¡

A

where IN is N £N identity matrix and JN is a matrix whose entries are all 1’s. Recall

˜

T

. Here U is an

the singular value decomposition (SVD) in linear algebra: X ˘ UDV

N £p orthogonal matrix (UT U ˘ Ip ) whose columns uj are called the left singular vectors; V is a p £p orthogonal matrix (VT V ˘ Ip ) with columns vj called the right singular vectors, and D is a p £ p diagonal matrix, with diagonal elements d1 ˚ d2 ˚ ¢¢¢ ˚ dp ˚ 0 known as the singular values.

Show that the solution Vq to problem (4) consists of the first q columns of V. (Then the optimal fiˆi are given by the i -th row with the first q columns of UD.)

Remark: The model (1), in general, gives the factor analysis in multivariate statistics:

x ˘ ¯Vq fi ¯

In traditional factor analysis, fij with j ˘ 1, . . . , q is assumed to be Gaussian and uncor-related as well as i with i ˘ 1, . . . , p. However, Independent Component Analysis (ICA) instead assumes fij with j ˘ 1, . . . , q is assumed to be non-Gaussian and independent. Because of the independence, ICA is particularly useful in separating mixed signals.

5

Intro to Big Data Science: Assignment 6 Solution
$30.00 $24.00