Name: HW3 Solution
SKU: 5662
Price: 35.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Gaussian Classi cation

Let f (x j C_i) N( _i; ²) for a two-class, one-dimensional classification problem with classes C₁ and C₂, P(C₁) = P(C₂) = 1=2, and ₂ > ₁.

Find the Bayes optimal decision boundary and the corresponding Bayes decision rule.

The Bayes error is the probability of misclassification,

P_e = P((misclassified as C₁) j C₂) P(C₂) + P((misclassified as C₂) j C₁) P(C₁):

Show that the Bayes error associated with this decision rule is

	1		Z_a	1	2
P_e =	p				e ^z	⁼²dz
		2

2	1
where a =		.
	2

Isocontours of Normal Distributions

Let f ( ; ) be the probability density function of a normally distributed random variable in R². Write code to plot the isocontours of the following functions, each on its own separate figure. You’re free to use any plotting libraries available in your programming language; for instance, in Python you can use Matplotlib.

1. f ( ; ), where =	2	1	3	and =	2	0	2	3	.
	6	1	7		6	1	0	7
	6		7		6			7
	6		7		6			7
	6		7		6			7
	4		5		4			5

2.	f ( ; ), where =		2	2¹	3	and =	2	1			4	3	.
			6		7		6	2			1	7
			6		7		6					7
			6		7		6					7
			6		7		6					7
3.	f ( ₁; ₁)		4		5		4		0	3		5		2	2	3	and ₁				= ₂ =				2	2	1	3	.
3.	f ( ₁; ₁)	f ( ₂; ₂), where ₁ =					2		2	3	, ₂ =			2	0	3	and ₁				= ₂ =				2	1	1	3	.
							6			7				6		7									6			7
							6			7				6		7									6			7
							6			7				6		7									6			7
							6			7				6		7									6			7
4.	f ( ₁; ₁)	f ( ₂; ₂), where ₁ =					4		0	5	, ₂ =			4	2	5	, ₁ =			2	2	1			4			5		2	2		1	3	.
4.	f ( ₁; ₁)	f ( ₂; ₂), where ₁ =					2		2	3	, ₂ =			2	0	3	, ₁ =			2	1	1	³ and ₂ =							2	1		4	3	.
							6			7				6		7				6			7							6				7
							6			7				6		7				6			7							6				7
							6			7				6		7				6			7							6				7
							6			7				6		7				6			7							6				7
5.	f ( ₁; ₁)	f ( ₂; ₂), where ₁ =					4		1	5	, ₂ =			4		5₁		3	, ₁	4	2	2	5₀	3						4	2	2		5₁		3	.
5.	f ( ₁; ₁)	f ( ₂; ₂), where ₁ =					2		1	3	, ₂ =			2		1		3	, ₁	=	2	0	1	3	and ₂ =						2	1		2		3	.
							6			7				6				7			6			7							6					7
							6			7				6				7			6			7							6					7
							6			7				6				7			6			7							6					7
							6			7				6				7			6			7							6					7
							4			5				4				5			4			5							4					5

Eigenvectors of the Gaussian Covariance Matrix

Consider two one-dimensional random variables X₁ N(3; 9) and X₂ ¹₂ X₁ + N(4; 4), where N( ; ²) is a Gaussian distribution with mean and variance ². Write a program that draws

= 100 random two-dimensional sample points from (X₁; X₂) such that the ith value sampled from X₂ is calculated based on the ith value sampled from X₁. In your code, make sure to choose and set a fixed random number seed for whatever random number generator you use, so your simulation is reproducible, and document your choice of random number seed and random number generator in your write-up. For each of the following parts, include the corresponding output of your program.

1. Compute the mean (in R²) of the sample.

1. Compute the 2 2 covariance matrix of the sample.

1. Compute the eigenvectors and eigenvalues of this covariance matrix.

1. On a two-dimensional grid with a horizonal axis for X₁ with range [ 15; 15] and a vertical axis for X₂ with range [ 15; 15], plot

1. 1. all n = 100 data points, and

1. 1. arrows representing both covariance eigenvectors. The eigenvector arrows should orig-inate at the mean and have magnitudes equal to their corresponding eigenvalues.

(e) Let U = [v₁ v₂] be a 2 2 matrix whose columns are the eigenvectors of the covariance matrix, where v₁ is the eigenvector with the larger eigenvalue. We use U^> as a rotation matrix to rotate each sample point from the (X₁; X₂) coordinate system to a coordinate system aligned with the eigenvectors. (As U^> = U ¹, the matrix U reverses this rotation, moving back from the eigenvector coordinate system to the original coordinate system). Center your sample points by subtracting the mean from each point; then rotate each point by U^>, giving x_rotated = U^>(x ). Plot these rotated points on a new two dimensional-grid, again with both axes having range [ 15; 15].

In your plots, clearly label the axes and include a title. Moreover, make sure the horizontal and vertical axis have the same scale! The aspect ratio should be one.

Classi cation and Risk

Suppose we have a classification problem with classes labeled 1; : : : ; c and an additional “doubt”

category labeled c + 1. Let r : R^d ! f1; : : : ; c + 1g be a decision rule. Define the loss function

L(r(x) = i; y = j) = ^< _r

: s

if i = j i; j 2 f1; : : : ; cg;

if i = c + 1;

otherwise;

where _r 0 is the loss incurred for choosing doubt and _s 0 is the loss incurred for making a misclassification. Hence the risk of classifying a new data point x as class i 2 f1; 2; : : : ; c + 1g is

R(r(x) = ijx) = L(r(x) = i; y = j) P(Y = jjx):

j=1

1. Show that the following policy obtains the minimum risk when _r _s.

(a) Choose class i if P(Y = ijx) P(Y = jjx) for all j and P(Y = ijx) 1 _r= _s;

1. Choose doubt otherwise.

What happens if _r = 0? What happens if _r > _s? Explain why this is consistent with what one would expect intuitively.

Maximum Likelihood Estimation

Let X₁; : : : ; X_n 2 R^d be n sample points drawn independently from a multivariate normal distribu-tion N( ; ).

(a) Suppose the normal distribution has an unknown diagonal covariance matrix

	2	₂²				3
	6	2				7
	6	²				7
=	6	1				7
=	6	3				7
	6					7
	6					7
	6					7
	6	:	:			7
	6		:	:		7
	6			:		7
	6					7
	6					7
	6				2	7
	6				2	7
	6					7
	6				d	7
	6				d	7
	6					7
	6					7
	6					7
	6					7
	6					7
	6					7
	6					7
	4					5

and an unknown mean . Derive the maximum likelihood estimates, denoted ˆ and ˆ _i, for and _i. Show all your work.

Suppose the normal distribution has a known covariance matrix and an unknown mean A , where and A are known d d matrices, is positive definite, and A is invertible. Derive the maximum likelihood estimate, denoted ˆ, for .

Covariance Matrices and Decompositions

As described in lecture, the covariance matrix Var(R) 2 R^{d d} for a random variable R 2 R^d with mean is

		2	Cov(R₂		;¹R₁)		Var(R₂)					Cov(R₂	; R_d)	3
Var(R) = Cov(R; R) = E[(R ) (R	)^>] =	6	Var(R )				Cov(R₁	;	R₂)	: : :		Cov(R₁	; R_d)	7	;
Var(R) = Cov(R; R) = E[(R ) (R	)^>] =	6		_::						: _:		_::		7	;
		6		:							:	:		7
		6		:							:	:		7
		6												7
		6												7
		6												7
		6												7
		6	Cov(R_d		;	R₁)	Cov(R_d	;	R₂)	: : :		Var(R_d)		7
		6	Cov(R_d			R₁)	Cov(R_d		R₂)			Var(R_d)		7
		6												7
		6												7
		6												7
		6												7
		6												7
		6												7
		6												7
		6												7
where Cov(R_i; R_j) = E[(R_{i i}) (R_j	_j)] and Var(⁴			R_i) = Cov(R_i; R_i).										5

If the random variable R is sampled from the multivariate normal distribution N( ; ) with the PDF

_f _(x) ₌ _p¹ _e ((x )^> ¹(x ))=2_;

(2 )^dj j

then Var(R) = .

Given n points X₁; X₂; : : : ; X_n sampled from N( ; ), we can estimate with the maximum likeli-hood estimator

ˆ	1		n		^>;
			^Xi1	) (X_i
=	n	(X_i			)
		=

which is also known as the covariance matrix of the sample.

ˆ ˆ

(a) The estimate makes sense as an approximation of only if is invertible. Under what

circumstances is ^ˆ not invertible? Make sure your answer is complete; i.e., it includes all

cases in which the covariance matrix of the sample is singular. Express your answer in terms of the geometric arrangement of the sample points X_i.

(b) Suggest a way to fix a singular covariance matrix estimator ^ˆ by replacing it with a similar but

invertible matrix. Your suggestion may be a kludge, but it should not change the covariance matrix too much. Note that infinitesimal numbers do not exist; if your solution uses a very small number, explain how to calculate a number that is su ciently small for your purposes.

(c) Consider the normal distribution N(0; ) with mean = 0. Consider all vectors of length 1; i.e., any vector x for which kxk = 1. Which vector(s) x of length 1 maximizes the PDF f (x)? Which vector(s) x of length 1 minimizes f (x)? Your answers should depend on the properties of . Explain your answer.

Gaussian Classi ers for Digits and Spam

In this problem, you will build classifiers based on Gaussian discriminant analysis. Unlike Home-work 1, you are NOT allowed to use any libraries for out-of-the-box classification (e.g. sklearn). You may use anything in numpy and scipy.

The training and test data can be found with this homework. Don’t use the training/test data from Homework 1, as they have changed for this homework. Submit your predicted class labels for the test data on the Kaggle competition website and be sure to include your Kaggle display name and scores in your writeup. Also be sure to include an appendix of your code at the end of your writeup.

Taking pixel values as features (no new features yet, please), fit a Gaussian distribution to each digit class using maximum likelihood estimation. This involves computing a mean and a covariance matrix for each digit class, as discussed in lecture.

Hint: You may, and probably should, contrast-normalize the images before using their pixel values. One way to normalize is to divide the pixel values of an image by the l₂-norm of its pixel values.

(Written answer.) Visualize the covariance matrix for a particular class (digit). How do the diagonal terms compare with the o -diagonal terms? What do you conclude from this?

Classify the digits in the test set on the basis of posterior probabilities with two di erent approaches.

1. Linear discriminant analysis (LDA). Model the class conditional probabilities as Gaus-sians N( _C; ) with di erent means _C (for class C) and the same covariance matrix , which you compute by averaging the 10 covariance matrices from the 10 classes.

To implement LDA, you will sometimes need to compute a matrix-vector product of the form ¹ x for some vector x. You should not try to compute the inverse of (nor the determinant of ). Instead, you should find a way to solve the implied linear system without computing the inverse.

Hold out 10,000 randomly chosen training points for a validation set. Classify each image in the validation set into one of the 10 classes (with a 0-1 loss function). Compute the error rate and plot it over the following numbers of randomly chosen training points: 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 30,000, 50,000. (Expect some variation in your error rate when few training points are used.)

1. Quadratic discriminant analysis (QDA). Model the class conditional probabilities as Gaussians N( _C; _C), where _C is the estimated covariance matrix for class C. (If any of these covariance matrices turn out singular, implement the trick you described in Q7(b). You are welcome to use k-fold cross validation to choose the right constant(s) for that trick.) Repeat the same tests and error rate calculations you did for LDA.

1. (Written answer.) Which of LDA and QDA performed better? Why?

Using the mnist data.mat, train your best classifier for the training data and clas-sify the images in the test data. Submit your labels to the online Kaggle competition. Record your optimum prediction rate in your submission. You are welcome to compute extra features for the Kaggle competition. If you do so, please describe your implemen-tation in your assignment. Please use extra features only for the Kaggle portion of the assignment.

In your submission, include plots of error rate versus number of training examples for both LDA and QDA. Similarly, include a plot of validation error versus the number of training points for each digit. Plot all the 10 curves on the same graph as shown in Figure 1. Which digit is easiest to classify? Include written answers where indicated.

Figure 1: Sample graph with 10 plots

Next, apply LDA or QDA (your choice) to spam. Submit your test results to the online Kaggle competition. Record your optimum prediction rate in your submission. If you use additional features (or omit features), please describe them.

Optional: If you use the defaults, expect relatively low classification rates. The TAs suggest using a bag-of-words model. You may use third-party packages to implement that if you wish. Also, normalizing your vectors might help.

HW3 Solution

Share this:

Share this:

Description

Share this:

Related products

Programming II Assignment 3: Calculator Solution

Programming II Assignment 4: Patient Location

Lab 2: Ray tracing a Sphere Solution

Lab 2 File Management System Calls Solution

Lab 1: Checkerboard Solution