Name: Intro to Big Data Science: Assignment 6 Solution
SKU: 11391
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Exercise 1

Log into “cookdata.cn”, and enroll the course “êâ‰Æ Ú”. Finish the online exer-cise there.

Exercise 2 Recall the definition of information entropy, H(P) ˘ ¡^Pⁿ_i_˘1 p_i log p_i , which means the maximal information contained in probability distribution P. Let X and Y be two random variables. The entropy H(X , Y ) for the joint distribution of (X , Y ) is defined similarly. The conditional entropy is defined as:

H(X jY ) ˘ ¡^X_j	P(Y ˘ y _j )H(X jY ˘ y _j )
˘ ¡^X	P(Y ˘ y _j )(^XP(X ˘ x_i jY ˘ y _j ) log P(X ˘ x_i jY ˘ y _j ))

Show that H(X , Y ) ˘ H(X ) ¯ H(Y jX ) ˘ H(Y ) ¯ H(X jY ).

The mutual information (information gain) is defined as I (X ; Y ) ˘ H(X )¡H(X jY ) ˘ H(Y ) ¡ H(Y jX ). Show that if X and Y are independent, then I (X ; Y ) ˘ 0

kx_i ¡x _j k²₂

2¾²

3.	I (X ; Y ) ˘ D_{K L} (p(X , Y )kp(X )p(Y )).	K L	k	˘ ¡	^P_i _˘1 ^pi ^log _p_i		. Show that
	Define the Kullback-Leibler divergence as D		(P Q)		n	q_i
	Define the Kullback-Leibler divergence as D		(P Q)		n

1. (Optional) Furthermore, show that D_{K L}(PkQ) ˚ 0 for any P and Q by using Jensen’s inequality. As a result, I (X ; Y ) ˚ 0.

Exercise 3 (EM Algorithm, you may need to carefully read Section 8.5.2 in the book “Elements of Statistical Learning” before solving this problem)

Imagine a class where the probability that a student gets an “A” grade is P(A) ˘ ¹₂ , a “B” grade is P(B) ˘ „, a “C” grade is P(C ) ˘ 2„, and a “D” grade is P(D) ˘ ¹₂ ¡ 3„. We are told that c students get a “C” and d students get a “D”. We don.t know how many

students got exactly an “A” or exactly a “B”. But we do know that h students got either an “A” or “B”. Let a be the number of students getting “A” and b be the number of students getting “B”. Therefore, a and b are unknown parameters with a ¯ b ˘ h. Our goal is to use expectation maximization to obtain a maximum likelihood estimate of „.

1. Use Multinoulli distribution to compute the log-likelihood function l(„, a, b).

2. Expectation step: Given „ˆ	(m)	, compute the expected values aˆ	(m)	ˆ	(m)	of a and
2. Expectation step: Given „ˆ		, compute the expected values aˆ		and b		of a and

b respectively.

3. Maximization step: Plug aˆ	(m)	ˆ	(m)	into the log-likelihood function l(„, a, b)
3. Maximization step: Plug aˆ		and b		into the log-likelihood function l(„, a, b)

and calculate for the maximum likelihood estimate „ˆ⁽^m^¯1) of „, as a function of „ˆ⁽^m⁾.

1. Iterating between the E-step and M-step will always converge to a local optimum of „ (which may or may not also be a global optimum)? Explain why in short.

Problem 4 (Spectral Clustering)

1. We consider the 2-clustering problem, in which we have N data points x_1:_N to be grouped in two clusters, denoted by A and B. Given the N by N affinity matrix W (Remember that in class we define the affinity matrix in the way that the diago-nal entries are zero for undirected graphs), consider the following two problems:

–

^{Min-cut: minimize} _i^P₂_{A j}^P₂_B ^Wi j ^;

^P_j ₁ W_{i j}

i 1^P

i A

j B ^Wi j

–

Normalized cut: minimize

i 2A

j 2B

^Wi j

j 2BW_{i j}

The data points are shown in Figure (a) above. The grid unit is 1. Let W_{i j} ˘ e^¡k^xi ^¡^xj ^k²2 , give the clustering results of min-cut and normalized cut respec-tively (Please draw a rough sketch and give the separation boundary in the answer book).

The data points are shown in Figure (b) above. The grid unit is 1. Let W_{i j} ˘

e^¡ , describe the clustering results of min-cut algorithm for ¾² ˘ 50

and ¾² ˘ 0.5 respectively (Please draw a rough sketch and give the separation boundaries for each case of ¾² in the answer book).

(a) (b)

Now back to the setting of the 2-clustering problem shown in Figure (a). The grid unit is 1.

a) If we use Euclidean distance to construct the affinity matrix W as follows:

W_{i j} ˘ ⁽	1,	if kx_i ¡x_j k₂² É ¾²;
W_{i j} ˘ ⁽	0,	otherwise.

What ¾² value would you choose? Briefly explain.

1. 1. The next step is to compute the first k ˘ 2 dominant eigenvectors of the affin-ity matrix W . For the value of ¾² you chose in the previous question, can you compute analytically the eigenvalues corresponding to the first two eigen-vectors? If yes, compute and report the eigenvalues. If not, briefly explain.

Exercise 5 (Dimensionality Reduction)

1. (PCA vs. LDA) Plot the directions of the first PCA (plot (a)) and LDA (plot (b)) components in the following figures respectively.

1. (PCA and SVD) Given 6 data points in 5D space, (1, 1, 1, 0, 0), (¡3, ¡3, ¡3, 0, 0), (2, 2, 2, 0, 0), (0, 0, 0, ¡1, ¡1), (0, 0, 0, 2, 2), (0, 0, 0, ¡1, ¡1). We can represent these data points by

a 6 £5 matrix X, where each row corresponds to a data point:

1	1	1	0	0	C
B ¡3	¡3	¡3	0	0	C

X ˘	B	2	2200C
	B		C

0 0¡1¡1C

BC^B

@	0	0	0	2	2	A
@	0	0	0	¡1	¡1	A

a) What is the sample mean of the data set?

What is the SVD of the data matrix X ˘ UDV^T , where U and V satisfy U^T U ˘ V^T V ˘ I₂? Note that the SVD for this matrix must take the following form,

where a, b, c, d, ¾₁, ¾₂ are the parameters you need to decide.

	B ¡3a		0	C
X ˘	B	²0	b	C	£^µ	0¹	¾₂	¶_£µ	0 0 0 d d ^¶
	B	a	0	C		¾	0		c c c	0 0
	B			C
	B			C
	B		¡2b	C
	B	0	¡2b	C
	@			A

1. - b

1. What is first principle component for the original data points?

1. If we want to project the original data points {x_i }⁶_i_˘1 into 1D space by principle component you choose, what is the sample variance of the projected data ^{^x^ˆi ^}⁶_i_˘1^?

1. For the projected data in d), now if we represent them in the original 5-d space, what is the reconstruction error ¹₆ ^P⁶_i_˘1 kx_i ¡xˆ_i k²₂?

Exercise 6(PCA as factor analysis and SVD, optional)

PCA of a set of data in R^p provide a sequence of best linear approximations to those data, of all ranks q É p. Denote the observations by x₁, x₂, . . . , x_N , and consider the rank-q linear model for representing them

(fi) ˘ „ ¯V_q fi

where „ is a location vector in R^p , V_q is a p £ q matrix with q orthogonal unit vectors as columns, and fi is a q vector of parameters. If we can find such a model, then we can reconstruct each x_i by a low dimensional coordinate vector fi_i through

x_i ˘ f (fi_i ) ¯†_i ˘ „ ¯V_q fi_i ¯†_i

(1)

where †_i 2 R^p are noise terms. Then PCA amounts to minimizing this reconstruction error by least square method

min ^X x „ V fi ²

k _i ¡ ¡ _{q i} k

^„^,{^fii ^},^Vq _i _˘1

1. Assume V_q is known and treat „ and fi_i as unknowns. Show that the least square problem

N				V fi		2
minx „				V fi		2
X
„,{fi_i } _i _˘1 ^k ⁱ ^{¡ ¡} ^{q i} ^k
is minimized by
	1		N
„ˆ ˘ x¯ ˘			X		x_i ,	(2)
„ˆ ˘ x¯ ˘	N	i ˘1			x_i ,	(2)
fiˆ_i ˘ V^T_q (x_i ¡x¯).						(3)

Also show that the solution for „ˆ is not unique. Give a family of solutions for „ˆ.

2. For the standard solution (2), we are left with solving

V_q

k(x_i ¡x) ¡V_q V_q (x_i

¡x)k

V_q

^‡X(I_p ¡V_q V_q )X

·_.

(4)

i ˘1

min

min Tr

_˜ T

Here we introduce the centered sample matrix

X^˜ (I_N

J_N )X

(x₁ ¡x¯)^T

_RN £p

_..

˘ ¡

˘ B

(x_N

x¯)^T

where I_N is N £N identity matrix and J_N is a matrix whose entries are all 1’s. Recall

˜	T	. Here U is an
the singular value decomposition (SVD) in linear algebra: X ˘ UDV		. Here U is an

N £p orthogonal matrix (U^T U ˘ I_p ) whose columns u_j are called the left singular vectors; V is a p £p orthogonal matrix (V^T V ˘ I_p ) with columns v_j called the right singular vectors, and D is a p £ p diagonal matrix, with diagonal elements d₁ ˚ d₂ ˚ ¢¢¢ ˚ d_p ˚ 0 known as the singular values.

Show that the solution V_q to problem (4) consists of the first q columns of V. (Then the optimal fiˆ_i are given by the i -th row with the first q columns of UD.)

Remark: The model (1), in general, gives the factor analysis in multivariate statistics:

x ˘ „ ¯V_q fi ¯†

In traditional factor analysis, fi_j with j ˘ 1, . . . , q is assumed to be Gaussian and uncor-related as well as †_i with i ˘ 1, . . . , p. However, Independent Component Analysis (ICA) instead assumes fi_j with j ˘ 1, . . . , q is assumed to be non-Gaussian and independent. Because of the independence, ICA is particularly useful in separating mixed signals.

Intro to Big Data Science: Assignment 6 Solution

Share this:

Share this:

Description

Share this:

Related products

ASSIGNMENT 03

Homework 5: Heap ADT using STL Solution

Assignment-(H) Solution

Simulation project Task 1- Hand simulation Solution

Programming Assignment II CustomFTP Server SOlution