Homework 01 Solution

Description

Rate this product

This assignment contains 6 questions. Please read and follow the following instructions.

DUE DATE: Oct 4th, 11:45 PM

TOTAL NUMBER OF POINTS: 135

NO PARTIAL CREDIT will be given so provide concise answers.

Clearly list your team ID, each team member’s names and Unity IDs at the top of your submission.

Submit only a single PDF le per group, containing your answers.

(45 points) [Song Ju] [Expectation Maximization]

You are running a Naive Bayes classi er using 3 binary feature variables X₁; X₂; X₃ to predict the status of a nuclear power plant Y ( 0 for “Normal” and 1 for “Malfunction”. Note that the actual status of the nuclear power plant Y cannot be directly observed but rather it can only be estimated through sensors such as the core temperatures. In this simpli ed scenario, let’s assume there is only one sensor, Z with binary values: 0 for “Normal” vs. 1 for “Abnormal”. One day you realize some of the sensor values Z are missing. Based on the nuclear power plant’s manual book, the probability of missing values is much higher when Y = 1 than when Y = 0. More speci cally, the exact values from the sensor speci cations are:

P (Z missingjY = 1) = :15; P (Z = 1jY = 1) = :85

P (Z missingjY = 0) = :03; P (Z = 0jY = 0) = :97

(5 points) Draw a Bayes net that represents this problem with a node Y that is the unobserved label, a sensor node Z that is either a copy of Y or has the value \missing”, and the three features X₁; X₂; X₃.

(5 points) What is the probability of Y = 1 given that our sensor data Z is missing,

i.e., P (Y = 1jZ = \missing”)? Derive your answer in terms of _Y ₌₁, denoting P(Y = 1).

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 2 of 5				09/11/2019

of each X_i variables under condition Y = 1, which are shown below.
Count(X₁	= 1jY = 1)	= 20;
Count(X₂	= 1jY = 1)	= 24;
Count(X₃ = 1jY = 1) = 4;
		1
P (X₁ = 1jY = 1) =				;
			2
P (X₂ = 1jY = 1) = ;
P (X₃ = 1jY = 1) = 1		3 ;

Assume X₁jY; X₂jY; X₃jY are all Bernoulli variables. Please apply log likelihood to calculate .

(15 points) Despite the approximation in part (c), we would like to theoretically learn the best choice of parameters for P (Y ); P (X₁jY ); P (X₂jY ), and P (X₃jY ). Assume Y; X₁jY; X₂jY; X₃jY are all Bernoulli variables and let us denote to include the following parameters:

X₁=x₁jY =y

X₂=x₂jY =y

X₃=x₃jY =y

_Y _=y = P (Y = y);

P (X₁ = x₁jY = y);

- P (X₂ = x₂jY = y)

P (X₃ = x₃jY = y):

Write the log-likelihood of X; Y and Z given , in terms of and P (ZjY ), rst for a single example (X₁ = x₁; X₂ = x₂; X₃ = x₃; Z = z; Y = y), then for n i.i.d. examples (X₁ⁱ = xⁱ₁; X₂ⁱ = xⁱ₂; X₃ⁱ = xⁱ₃; Zⁱ = zⁱ; Y ⁱ = yⁱ) for i = 1; :::; n.

(15 points) Provide the E-step and M-step for performing expectation maximiza-tion of for this problem. For the (t + 1)th iteration, in the E-step compute the distribution Q_t+1(Y jZ; X) using

Q_t+1(Y = 1jZ; X) = E[Y jZ; X₁; X₂; X₃; _t]

using your Bayes net from part (a) and conditional probability from part (b) for the unobserved class label Y of a single example.

In the M-step, compute:

n	X
^Xi	X
_t+1 = argmax	Q(Y ⁱ = yjZⁱ; Xⁱ)logP (X₁ⁱ; X₂ⁱ; X₃ⁱ; Y ⁱ; Zⁱj _t)
=1	y

using all of the examples (X₁¹; X₂¹; X₃¹; Y ¹; Z¹); :::, (X₁ⁿ; X₂ⁿ; X₃ⁿ; Y ⁿ; Zⁿ). Note: it is OK to leave your answers in terms of Q(Y jZ; X).

(20 points) [Song Ju] Gaussian Mixture Model (GMM)

_{Consider the set of training data in the graph below, let’s assume it contains three clusters. For GMM, the means and variances of three Gaussians are 0 and 0,}1

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 3 of 5	09/11/2019

and ₁, and ₂ and ₂, respectively. Additionally, we have ₀; ₁; ₂ to denote the mixture proportions of the three Gaussians (i.e., p(x) = ₀N( ₀; ₀I) + ₁N( ₁; ₁I) + ₂N( ₂; ₂I)), where I is the identity matrix and ₀ + ₁ + ₂ = 1. We will also use

to refer to the entire collection of parameters ( ₀; ₁; ₂; ₀; ₁; ₂; ₀; ₁; ₂) de ning the mixture model p(x).

(10 points) Would K-Means (K = 3) and our 3-cluster GMM trained using EM produce the same cluster centers (means) for this data set above? Justify your answer. (Answer without any justi cation will get zero point.)

(10 points) In the following, we apply EM to train our 3-cluster GMM on the data below. The ‘+’ points indicate the current means ₀, ₁, and ₂ of the three Gaussians after the k th iteration of EM.

(b.1) (3 points) On the gure, draw the directions in which ₀, ₁ and ₂ will move in the next EM iteration.

(b.2) (3 points) Will the marginal likelihood of the data, ^Q_j P (x^jj ) increase or decrease on the next EM iteration? Explain your reasoning.

(b.3) (4 points) Will the estimate of ₀ increase or decrease on the next EM itera-tion? Explain your reasoning.

(26 points) [Farzaneh Khoshnevisan] Semi-supervised Learning

Consider the following gure which contains labeled (L) (class 1: black circles, class 2: hollow circles) and unlabeled (U) (blue squares) data. In this question, you will use two semi-supervised methods: S³VM and co-training, to utilize the unlabeled data for further improvement of a SVM classi er.

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 4 of 5	09/11/2019

1. (10 points) Explain how would the semi-supervised SVM (S³VM) perform on this data as compared to the supervised SVM, by plotting the separating hyper-planes produced by both algorithms. For SVM, please draw marginal boundaries and separating hyper-plane by solid lines. Using a di erent color, draw the marginal boundaries and the separating hyper-plane of S³VM using dash lines.

1. (10 points) In applying co-training, g₁ and g₂ are SVM classi ers, and p_i/n_i repre-sents the number of positive or negative points to label at iteration i, respectively. In this example, assume that p₁; n₁ = 2 and p₂; n₂ = 1.

(b.1) (5 points) Explain What is the main underlying assumption in applying co-training? How can we apply it to the above data (what are the two classi ers)? (b.2) (5 points) Identify the label of each point x_i 2 U and the nal SVM separating hyper-plane after applying 2 iterations of semi-supervised co-training. Assume that at iteration i, p_i and n_i points are labeled based on the farthest distance from the separation hyper-plane.

1. (6 points) Compare the separating hyper-planes produced by S³VM from part (a) and co-training from part (b.2). Explain why you think these two algorithms per-form similarly/di erently?

(15 points) [Farzaneh Khoshnevisan] Generalized Sequential Pattern Consider a data sequence:

S = (fA; BgfB; CgfD; EgfA; DgfB; E; F g)

and the following time constraints:

min gap=0 (interval between last event in e_i and rst event in e_i+1 is > 0) max gap=2 (interval between rst event in e_i and last event in e_i+1 is <= 2) max span=5 (interval between rst event in e₁ and last event in e_last is <= 6) w_s = 1 (time between rst and last events in e_i is <= 1)

For each of the sequences w = (e₁; :::; e_last) below, determine whether they are subse-quences of S, and if not, which constraint is excluding them.

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 5 of 5	09/11/2019

- 1. (3 points) w = (fAgfBgfCgfDgfEg)

- 1. (3 points) w = (fBgfDgfEg)

- 1. (3 points) w = (fD; EgfD; Eg)

- 1. (3 points) w = (fAgfC; D; EgfA; F g)

- 1. (3 points) w = (fA; B; C; DgfE; F g)

(15 points) [Farzaneh Khoshnevisan] Generalized Sequential Pattern Consider the following frequent 3-sequences:

- f1; 2; 3g >; < f1; 3gf4g >; < f1gf2; 4g >; < f1gf4gf5g >; < f2; 3gf4g >; < f2; 3gf5g > ; < f2; 4gf4g >; < f2gf4; 5g >; < f3gf4; 5g >; < f3gf4gf5g >; < f4gf4; 5g > :
  1. (5 points) List all the candidate 4-sequences produced by the candidate generation step of the GSP algorithm.

- 1. (5 points) List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming no timing constraints).

- 1. (5 points) List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming max gap = 1).

(14 points) [Farzaneh Khoshnevisan] Decision Theory

Imagine you want to purchase a start-up company. With 5% chance the company’s value will be up and you will make $10,000K, and 95% chance that the company will fail and you will loose $600K. Additionally, it would cost you $200K to hire a group of experts for consultation who will help you determine how promising the company will be. This group of the experts are known to be accurate 75% of time.

Your goal is to maximize the expected value of your decision. What, if any, the best action(s) should you take, and what is your expected value? Assume that you are risk-neutral.

Draw a decision tree that supports your conclusion and show all the probabilities and utility values for every node.

Share this:

Share this:

Description

Share this:

Related products

Lab 4 Process Management System Calls Solution

Lab 1: Checkerboard Solution

Assignment-1 Solution

Lab 03 Process Management System Calls Solution

Assignment_4 Solution