CSC 4850/6850: Machine Learning Homework 2

$30.00 $24.00

Instructions There are 2 written questions, plus 2 programming questions. Submit your written answers as papers and your codes to the iCollege. Feature Maps, Kernels, and SVM [25 points] You are given a data set D in the below gure with data from a single feature X1 and corresponding label Y 2 f+; . The…

Rate this product

You’ll get a: zip file solution

 

Categorys:

Description

Rate this product

Instructions There are 2 written questions, plus 2 programming questions. Submit your written answers as papers and your codes to the iCollege.

  • Feature Maps, Kernels, and SVM [25 points]

You are given a data set D in the below gure with data from a single feature X1 and corresponding label

Y 2 f+; . The data set contains three positive examples at X1 = f 3; 2; 3g and three negative examples

at X1 = f 1; 0; 1g. (Red circle and blue square represent + and , respectively.)

1.1 Finite Features and SVMs

  1. (2 points) Can this data set (in its current feature space) be perfectly separated using a linear separator? Why or why not?

  1. (2 points) Let us de ne the simple feature map (u) = (u; u2) which transforms the data into two-dimensional space. Apply to the data and plot the points in the new two-dimensional feature space.

  1. (2 points) Can a linear separator perfectly separate the points in the new two-dimensional feature space induced by ? Why or why not?

  1. (4 points) Construct a maximum-margin separating hyperplane. This hyper-plane will be a line, which can be parameterized by its normal equation, i.e. w1Y1 +w2Y2 +c = 0 for appropriate choices of w1,w2, c. Here, (Y1; Y2) = (X1) is the result of applying the feature map to the original feature X1. Give the values for w1,w2, c. Also, explicitly compute the margin for your hyper-plane. You do not need to solve a quadratic program to nd the maximum margin hyper-plane. Instead, let your geometric intuition guide you.

  1. (2 points) On the plot of the transformed points, plot the separating hyperplane and the margin, and circle the support vectors.

  1. (2 points) Draw the decision boundary separating the separating hyper-plane, in the original one-dimensional feature space.

1.2 In nite Features Spaces and Kernel Magic

Let’s de ne a new (in nitely) more complicated feature transformation 1.

e

x2=2 2

e

x2=2

i

1 = fe x2=2; e x2=2x;

p

x

;:::;

p

x

; : : : g

(1)

2

i!

You can think of this feature transformation as taking some nite feature vector and producing an in nite dimensional feature vector rather than the simple two-dimensional feature vector used in the earlier part of this problem.

  1. (2 points) Can we directly apply this feature transformation to the data? Put another way, can we explicitly construct 1(x)? (This is nearly rhetorical and not a trick question.)

  1. (4 points) We know that we can express a linear classi er using only inner products of support vectors in the transformed feature space. It would be great if we could somehow use the feature space obtained by the feature transformation 1. However, to do this we must be able to compute the inner product of examples in this in nite vector space. Lets de ne the inner product between two in nite vectors a = fa1; : : : ; ai; : : : g and b = fb1; : : : ; bi; : : : ; g as the in nite sum given in the following equation.

1

Xi

k(a; b) = a b = aibi

(2)

=1

Can we explicitly compute k(a; b)? What is the explicit form of k(a; b)? (Hint: you may want to use the Taylor series expansion of ex which is given in the following equation.)

Xn xi

ex = lim (3)

n!1 i!

i=0

  1. (2 points) With such a high-dimensional feature space should we be concerned about over tting?

  1. (3 points) Suppose we translate my inputs x0 = x + x0 for some arbitrary x0 before using the in nite kernel above in an SVM. Will my predictions change? Explain why or why not?

  • Na ve Bayes Classi er [15 points]

You are running a Na ve Bayes classi er for a classi cation problem with one (unobserved) binary class variable Y (e.g. whether it’s too hot for your dog in here) and 3 binary feature variables X1,X2,X3. The class value is never directly seen but approximately observed using a sensor (e.g. you see your dog panting). Let Z be the binary variable representing the sensor values. One morning (your dog is out to play) you realize the sensor value is missing in some of the examples. From the sensor speci cations (that come with your dog), you learn that the probability of missing values is four times higher when Y = 1 than when

  • = 0. More speci cally, the exact values from the sensor speci cations are:

P (Z missingjY = 1) = 0:08; P (Z = 1jY = 1) = 0:92

P (Z missingjY = 0) = 0:02; P (Z = 0jY = 0) = 0:98

  1. (5 points) Draw a Bayes net that represents this problem with a node Y that is the unobserved label, a node Z that is either a copy of Y or has the value “missing”, and the three features X1,X2,X3.

  1. (5 points) What is the probability of the unobserved class label being 1 given no other information, i.e., P (Y = 1jZ = missing)? Derive the quantity using the Bayes rule and write your nal answer in terms of Y =1, our estimate of P (Y = 1).

  1. (5 points) We would like to learn the best choice of parameters for P (Y ); P (X1jY ); P (X2jY ), and P (X3jY ). Write the log- joint probability of X; Y and Z given your Bayes net, rst for a single

example (X1 = x1; X2 = x2; X3 = x3; Z = z; Y = y), then for n i.i.d. examples (X1i = xi1; X2i = xi2; X3i = xi3; Zi = zi; Y i = yi) for i = 1; : : : ; n:

  • K-means Clustering [20 points]

In this problem, we will be working with the partial digits data set. The data le

digitdata_partial.mat

we provide contains 1000 observations of 157 pixels (a subset of the original 784) concerning handwritten digits (either 1, 3, 5, or 7 ). The variable X is a 1,000 157 matrix. The 1,000-dimensional vector Y contains the true number for each image. Your programming assignment is to implement the K-means clustering algorithm on this partial digit data. De ne convergence as no change in label assignment from one step to another or you have iterated 20 times (whichever comes rst). Since there are 157 features this algorithm may take a couple of minutes to run.

When you have implemented the algorithm please submit the following:

  1. (10 points) A plot of the sum of the within-group sum of squares versus k for k = 1; 2; 3; 4; 5.

The goal of clustering can be thought of as minimizing the variation within groups and consequently maximizing the variation between groups. A good model has a low sum of squares within each group. We de ne the sum of squares in the traditional way. Let Ck be the kth cluster and let k be the empirical mean of the observations xi in cluster Ck. Then the within-group sum of squares for cluster

Ck is de ned as:

X

SS(k) =(xi Ck )2

i2Ck

Then if there are K clusters total then the \sum of within-group sum of squares” is just the sum of all

K of these individual SS(k) terms.

  1. (10 points) A plot of total mistake rate versus k for k = 1; 2; 3; 4; 5.

Given that we know the actual assignment labels for each data point we can attempt to analyze how well the clustering recovered this. For cluster Ck let its assignment be whatever the majority vote is for that cluster. For example if for one cluster we had 270 observations labeled one, 50 labeled three, 9 labeled ve, and 0 labeled seven then that cluster will be assigned value one and had 50 + 9 + 0 = 59 mistakes for a total mistake rate of 59=(270 + 59) = 17:93%. If we add up the total number of \mistakes” for each cluster and divided by the total number of observations (1000) we will get our total mistake rate.

  • Principal Components Analysis [40 points]

In this problem, we will be working with the full digits data set. The data le digitdata.mat

we provide contains 60,000 hand-written digits between 0 and 9. Each digit is a 28 28 grayscale image represented as a 784-dimensional vector. The variable X is a 60,000 784 matrix. The 60,000-dimensional vector Y contains the true number for each image.

A very common technique for dimensionality reduction is principal components analysis typically referred to as PCA. Here you will need to implement PCA. Because this is a relatively large data set, consider using the functions cov and eig or eigs. Please submit in your write-up a copy of all plots for this problem.

  1. (15 points) Plot the rst 9 principal components as images. You will probably need the functions image, colormap(’Gray’), subplot, and reshape(v,28,28). To plot the principal components rescale the vector so that its values range between 0 and 255.

  1. (10 points) Plot the eigenvalues in decreasing order. From the plot, how many eigenvectors do you believe are necessary to approximately represent the images?

  1. (10 points) Using the rst 1, 2, 5, 10, 21, 44, 94, 200, and 784 principal components plot the recon-struction of the rst 2 digits in the data set. Use subplot(3,3,i) to save a tree of the natural kind. Does the approximation get better with increasing principal components?

3

CSC 4850/6850: Machine Learning Homework 2
$30.00 $24.00