Homework 01 Solution

$30.00 $24.00

This assignment contains 6 questions. Please read and follow the following instructions. DUE DATE: Oct 4th, 11:45 PM TOTAL NUMBER OF POINTS: 135 NO PARTIAL CREDIT will be given so provide concise answers. Clearly list your team ID, each team member’s names and Unity IDs at the top of your submission. Submit only a single…

Rate this product

You’ll get a: zip file solution

 

Categorys:
Tags:

Description

Rate this product

This assignment contains 6 questions. Please read and follow the following instructions.

DUE DATE: Oct 4th, 11:45 PM

TOTAL NUMBER OF POINTS: 135

NO PARTIAL CREDIT will be given so provide concise answers.

Clearly list your team ID, each team member’s names and Unity IDs at the top of your submission.

Submit only a single PDF le per group, containing your answers.

  1. (45 points) [Song Ju] [Expectation Maximization]

You are running a Naive Bayes classi er using 3 binary feature variables X1; X2; X3 to predict the status of a nuclear power plant Y ( 0 for “Normal” and 1 for “Malfunction”. Note that the actual status of the nuclear power plant Y cannot be directly observed but rather it can only be estimated through sensors such as the core temperatures. In this simpli ed scenario, let’s assume there is only one sensor, Z with binary values: 0 for “Normal” vs. 1 for “Abnormal”. One day you realize some of the sensor values Z are missing. Based on the nuclear power plant’s manual book, the probability of missing values is much higher when Y = 1 than when Y = 0. More speci cally, the exact values from the sensor speci cations are:

P (Z missingjY = 1) = :15; P (Z = 1jY = 1) = :85

P (Z missingjY = 0) = :03; P (Z = 0jY = 0) = :97

  1. (5 points) Draw a Bayes net that represents this problem with a node Y that is the unobserved label, a sensor node Z that is either a copy of Y or has the value \missing”, and the three features X1; X2; X3.

  1. (5 points) What is the probability of Y = 1 given that our sensor data Z is missing,

i.e., P (Y = 1jZ = \missing”)? Derive your answer in terms of Y =1, denoting P(Y = 1).

(c) (5 points) Based on the sensor data Z, we approximate the count and probabilities

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 2 of 5

09/11/2019

of each Xi variables under condition Y = 1, which are shown below.

Count(X1

= 1jY = 1)

= 20;

Count(X2

= 1jY = 1)

= 24;

Count(X3 = 1jY = 1) = 4;

1

P (X1 = 1jY = 1) =

;

2

P (X2 = 1jY = 1) = ;

P (X3 = 1jY = 1) = 1

3 ;

Assume X1jY; X2jY; X3jY are all Bernoulli variables. Please apply log likelihood to calculate .

  1. (15 points) Despite the approximation in part (c), we would like to theoretically learn the best choice of parameters for P (Y ); P (X1jY ); P (X2jY ), and P (X3jY ). Assume Y; X1jY; X2jY; X3jY are all Bernoulli variables and let us denote to include the following parameters:

X1=x1jY =y

X2=x2jY =y

X3=x3jY =y

Y =y = P (Y = y);

  • P (X1 = x1jY = y);

    • P (X2 = x2jY = y)

  • P (X3 = x3jY = y):

Write the log-likelihood of X; Y and Z given , in terms of and P (ZjY ), rst for a single example (X1 = x1; X2 = x2; X3 = x3; Z = z; Y = y), then for n i.i.d. examples (X1i = xi1; X2i = xi2; X3i = xi3; Zi = zi; Y i = yi) for i = 1; :::; n.

  1. (15 points) Provide the E-step and M-step for performing expectation maximiza-tion of for this problem. For the (t + 1)th iteration, in the E-step compute the distribution Qt+1(Y jZ; X) using

Qt+1(Y = 1jZ; X) = E[Y jZ; X1; X2; X3; t]

using your Bayes net from part (a) and conditional probability from part (b) for the unobserved class label Y of a single example.

In the M-step, compute:

n

X

Xi

t+1 = argmax

Q(Y i = yjZi; Xi)logP (X1i; X2i; X3i; Y i; Zij t)

=1

y

using all of the examples (X11; X21; X31; Y 1; Z1); :::, (X1n; X2n; X3n; Y n; Zn). Note: it is OK to leave your answers in terms of Q(Y jZ; X).

  1. (20 points) [Song Ju] Gaussian Mixture Model (GMM)

Consider the set of training data in the graph below, let’s assume it contains three clusters. For GMM, the means and variances of three Gaussians are 0 and 0, 1

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 3 of 5

09/11/2019

and 1, and 2 and 2, respectively. Additionally, we have 0; 1; 2 to denote the mixture proportions of the three Gaussians (i.e., p(x) = 0N( 0; 0I) + 1N( 1; 1I) + 2N( 2; 2I)), where I is the identity matrix and 0 + 1 + 2 = 1. We will also use

to refer to the entire collection of parameters ( 0; 1; 2; 0; 1; 2; 0; 1; 2) de ning the mixture model p(x).

  1. (10 points) Would K-Means (K = 3) and our 3-cluster GMM trained using EM produce the same cluster centers (means) for this data set above? Justify your answer. (Answer without any justi cation will get zero point.)

  1. (10 points) In the following, we apply EM to train our 3-cluster GMM on the data below. The ‘+’ points indicate the current means 0, 1, and 2 of the three Gaussians after the k th iteration of EM.

(b.1) (3 points) On the gure, draw the directions in which 0, 1 and 2 will move in the next EM iteration.

(b.2) (3 points) Will the marginal likelihood of the data, Qj P (xjj ) increase or decrease on the next EM iteration? Explain your reasoning.

(b.3) (4 points) Will the estimate of 0 increase or decrease on the next EM itera-tion? Explain your reasoning.

  1. (26 points) [Farzaneh Khoshnevisan] Semi-supervised Learning

Consider the following gure which contains labeled (L) (class 1: black circles, class 2: hollow circles) and unlabeled (U) (blue squares) data. In this question, you will use two semi-supervised methods: S3VM and co-training, to utilize the unlabeled data for further improvement of a SVM classi er.

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 4 of 5

09/11/2019

    1. (10 points) Explain how would the semi-supervised SVM (S3VM) perform on this data as compared to the supervised SVM, by plotting the separating hyper-planes produced by both algorithms. For SVM, please draw marginal boundaries and separating hyper-plane by solid lines. Using a di erent color, draw the marginal boundaries and the separating hyper-plane of S3VM using dash lines.

    1. (10 points) In applying co-training, g1 and g2 are SVM classi ers, and pi/ni repre-sents the number of positive or negative points to label at iteration i, respectively. In this example, assume that p1; n1 = 2 and p2; n2 = 1.

(b.1) (5 points) Explain What is the main underlying assumption in applying co-training? How can we apply it to the above data (what are the two classi ers)? (b.2) (5 points) Identify the label of each point xi 2 U and the nal SVM separating hyper-plane after applying 2 iterations of semi-supervised co-training. Assume that at iteration i, pi and ni points are labeled based on the farthest distance from the separation hyper-plane.

    1. (6 points) Compare the separating hyper-planes produced by S3VM from part (a) and co-training from part (b.2). Explain why you think these two algorithms per-form similarly/di erently?

  1. (15 points) [Farzaneh Khoshnevisan] Generalized Sequential Pattern Consider a data sequence:

S = (fA; BgfB; CgfD; EgfA; DgfB; E; F g)

and the following time constraints:

min gap=0 (interval between last event in ei and rst event in ei+1 is > 0) max gap=2 (interval between rst event in ei and last event in ei+1 is <= 2) max span=5 (interval between rst event in e1 and last event in elast is <= 6) ws = 1 (time between rst and last events in ei is <= 1)

For each of the sequences w = (e1; :::; elast) below, determine whether they are subse-quences of S, and if not, which constraint is excluding them.

Fall 2019: CSC 591/791 CSC 591/791: HW 1 – Page 5 of 5

09/11/2019

      1. (3 points) w = (fAgfBgfCgfDgfEg)

      1. (3 points) w = (fBgfDgfEg)

      1. (3 points) w = (fD; EgfD; Eg)

      1. (3 points) w = (fAgfC; D; EgfA; F g)

      1. (3 points) w = (fA; B; C; DgfE; F g)

  1. (15 points) [Farzaneh Khoshnevisan] Generalized Sequential Pattern Consider the following frequent 3-sequences:

    • f1; 2; 3g >; < f1; 3gf4g >; < f1gf2; 4g >; < f1gf4gf5g >; < f2; 3gf4g >; < f2; 3gf5g > ; < f2; 4gf4g >; < f2gf4; 5g >; < f3gf4; 5g >; < f3gf4gf5g >; < f4gf4; 5g > :

      1. (5 points) List all the candidate 4-sequences produced by the candidate generation step of the GSP algorithm.

      1. (5 points) List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming no timing constraints).

      1. (5 points) List all the candidate 4-sequences pruned during the candidate pruning step of the GSP algorithm (assuming max gap = 1).

  1. (14 points) [Farzaneh Khoshnevisan] Decision Theory

Imagine you want to purchase a start-up company. With 5% chance the company’s value will be up and you will make $10,000K, and 95% chance that the company will fail and you will loose $600K. Additionally, it would cost you $200K to hire a group of experts for consultation who will help you determine how promising the company will be. This group of the experts are known to be accurate 75% of time.

Your goal is to maximize the expected value of your decision. What, if any, the best action(s) should you take, and what is your expected value? Assume that you are risk-neutral.

Draw a decision tree that supports your conclusion and show all the probabilities and utility values for every node.

Homework 01 Solution
$30.00 $24.00