CSC 4850/6850: Machine Learning Homework 1

$30.00 $24.00

Parameter Estimation [15 points] This question uses a probability distribution called the Poisson distribution. A discrete random variable X follows a Poisson distribution with parameter if p(Xj ) = X e : X! The Poisson distribution is a useful discrete distribution that can be used to model the number of occurrences of something per unit…

Rate this product

You’ll get a: zip file solution

 

Categorys:

Description

Rate this product
  • Parameter Estimation [15 points]

This question uses a probability distribution called the Poisson distribution. A discrete random variable X follows a Poisson distribution with parameter if

p(Xj ) = X e :

X!

The Poisson distribution is a useful discrete distribution that can be used to model the number of occurrences of something per unit of time. For example, if a bank teller sits at a counter, the number of customers arriving in each interval, say 30 minutes, is in the Poisson distribution.

Here, we will estimate the parameter from n observations fX1; : : : ; Xi; : : : Xn g (e.g., the number of customers for the ith teller in 30 minutes) which we assume are drawn i.i.d from the Poisson distribution.

  1. (3 points) Compute the log-likelihood for observations fX1; : : : ; Xng.

  1. (4 points) Compute the MLE for .

  1. (8 points) Now let’s be Bayesian and put a prior distribution over the parameter . Your extensive experience in statistics tells you that the good prior distribution for is a Gamma distribution.

p( j ; ) =

1e ; > 0:

)

It is well known that the mode of in the Gamma distribution with and is ( 1)= for > 1.

Recall that the mode in the statistics represents the value that appears most often (i.e., the maxima

P

of the probability mass function). Then, compute the MAP for . (Hint: Xi+ 1e (n+ ) can be represented by a Gamma distribution.)

  • Linear Regression and LOOCV [20 points]

In class, you learned about using cross-validation to estimate a learning algorithm’s true error. A solution that provides the best estimate of this true error is Leave-One-Out Cross Validation (LOOCV), but it can take a really long time to compute the LOOCV error. In this problem, you will derive an algorithm for e ciently computing the LOOCV error for linear regression using the Hat Matrix. (Unfortunately, such an e cient trick may not be easily found for other learning methods.)

Assume that there are r given training examples, (X1; Y1); (X2; Y2); : : : ; (Xr; Yr), where each input data point Xi, has n real valued features. Regression aims to learn to predict Y from X. The linear regression model assumes that the output Y is a linear combination of the input features plus Gaussian noise with weights given by .

We can write this in matrix form by stacking the data points as the rows of a matrix X so that xij is the j-th feature of the i-th data point. Then writing Y , , and as column vectors, we can write the matrix form of the linear regression model as:

  • = X +

where:

2 Y2

3

2 x21

x22

: : :

x2n

3

2

2

3

2 2

3

6

Y1

7

6

x11

x12

: : :

x1n

7

6

1

7

6

1

7

Y =

...

; X =

...

...

...

...

; =

...

;

and =

...

6

Y

7

6

x

r1

x

r2

: : :

x

7

6

n

7

6

r

7

6

r

7

6

rn

7

6

7

6

7

4

5

4

5

4

5

4

5

Assume that i is normally distributed with variance 2. We saw in class that the maximum likelihood estimate of the model parameters (which also happens to minimize the sum of squared prediction errors) is given by the Normal equation:

^

T

X)

1

X

T

Y

= (X

^

^

De ne Y to be the vector of predictions using if we were to plug in the original training set X:

  • ^

Y = X

    • X(XT X) 1XT Y

    • HY

where we de ne H = X(XT X) 1XT (H is often called the Hat Matrix ).

^

As mentioned above, , also minimizes the sum of squared errors:

r

X

^ 2

SSE = (Yi Yi)

i=1

Now recall that the Leave-One-Out Cross Validation score is de ned to be:

r

Xi

^ (

i) 2

Yi

)

LOOCV = (Yi

=1

^ ( i) P ^ ( i) 2

where Y is the estimator of Y after removing the i-th observation (e.g., it minimizes j6=i(Yj Yj ) ).

1.

^

in terms of H and Y .

(2 point) Write Yi

2.

^ (

i)

is also the estimator which minimizes SSE for Z where

(5 points) Show that Y

j

Y^i( i); j = i

Z

=

Yj;

j 6= i

^ (

i)

^ (

i)

= Zi, but give an answer that includes

3.

(3 point) Write Yi

in terms of H and Z. By de nition, Yi

both H and Z.

4.

^

^ (

i)

= HiiYi

^ (

i)

, where Hii denotes the i-th element along the

(5 points) Show that Yi

Yi

HiiYi

diagonal of H.

5.

(5 points) Show that

r

Yi

^

2

Xi

Yi

1

Hii !

LOOCV =

=1

Note: We have the closed-form form for LOOCV of linear regression indicating that the algorithmic complexity is very low.

  • Decision Trees [15 points]

Consider the following set of training examples for the unknown target function < X1; X2 >! Y . Each row indicates the values observed, and how many times that set of values was observed. For example, (+; T; T ) was observed 3 times, while ( ; T; T ) was never observed.

Y

X1

X2

Count

+

T

T

3

+

T

F

4

+

F

T

4

+

F

F

1

T

T

0

T

F

1

F

T

3

F

F

5

1. (3 points) What is the sample entropy H(Y ) for this training data (with logarithms base 2)?

2. (4 points) What are the information gains IG(X1) = H(Y ) H(Y jX1) and IG(X2) = H(Y ) H(Y jX2) for this sample of training data?

  1. (8 points) Draw the decision tree based on the information gain (without postpruning) from this sample of training data.

  • Logistic Regression [25 points]

For this problem, you need to download the Breast Cancer dataset from course webpage. The description of this dataset is in https://rpubs.com/kstahl/wdbc_ann. I have removed the records with missing values for you. Here, you will obtain the learning curves (accuracy vs. training data size). Implement a logistic regression classi er with the assumption that each attribute value for a particular record is independently generated. You should submit the code electronically to iCollege.

  1. (10 points) Brie y describe how you implement it by giving the pseudocode. The pseudocode must include equations for estimating the classi cation parameters and for classifying a new example. Re-member, this should not be a printout of your code, but a high-level outline.

  1. (15 points) Plot a learning curve: the accuracy vs. the size of the training data. Generate six points on the curve, using [.01 .02 .03 .125 .625 1] fractions of your training set and testing on the full test set each time. Average your results over 5 random splits of the data into a training and test set (always keep 2/3 of the data for training and 1/3 for testing, but randomize over which points go to training set and which to testing). This averaging will make your results less dependent on the order of records in the le. Specify your choice of regularization parameters and keep those parameters constant for these tests. A typical choice of constants would be = 0 (no regularization).

  • AdaBoost [25 points]

For this problem, you need to download the Bupa Liver Disorder dataset that is available on the course web-site. The description of this dataset is in https://archive.ics.uci.edu/ml/datasets/liver+disorders Here, you will predict whether an individual has a liver disorder (indicated by the selector feature) based on the results of a number of blood tests and levels of alcohol consumption. Implement the AdaBoost algorithm using a decision stump as the weak classi er. You should submit the code electronically to iCollege.

AdaBoost trains a sequence of classi ers. Each classi er is trained on the same set of training data

(xi; yi); i = 1; :::; m, but with the signi cance Dt(i) of each example fxi; yig weighted di erently. At each Pm

iteration, a classi er, ht(x) ! f 1; 1g, is trained to minimize the weighted classi cation error, i=1 Dt(i) I(ht(xi) 6= yi), where I is the indicator function (0 if the predicted and actual labels match, and 1 otherwise).

The overall prediction of the AdaBoost algorithm is a linear combination of these classi ers, HT (x) =

A

P

sign(

tT=1 tht(x)).

decision stump is a decision tree with a single node (a depth 1 decision tree). It corresponds to a single

threshold in one of the features and predicts the class for examples falling above and below the threshold respectively, ht(x) = C1I(xj c) + C2I(xj < c), where xj is the j th component of the feature vector x. Unlike in class, where we split on Information Gain, for this algorithm split the data based on the weighted classi cation accuracy described above, and nd the class assignments C1; C2 2 f 1; 1g, threshold c, and feature choice j that maximizes this accuracy.

  1. (10 points) Using all of the data for training, display the selected feature component j, threshold c, and class label C1 of the decision stump ht(x) used in each of the rst 10 boosting iterations (t = 1; 2; :::; 10).

  1. (15 points) Use 90% of the dataset for training and 10% for testing. Average your results over 50 random splits of the data into training sets and test sets. Limit the number of boosting iterations to

100. In a single plot show:

average training error after each boosting iteration average test error after each boosting iteration

4

CSC 4850/6850: Machine Learning Homework 1
$30.00 $24.00