CSE -Assignment 3: Non‐Parametric Inference Solution

$30.00 $24.00

1. MSE in terms of bias (Total 5 points) For some estimator , show that MSE = bias2( ) + Var( ). Show your steps clearly. 2. Practice with empirical CDF (eCDF) (Total 5 points) Using the first 10 samples from the collisions.csv file on the class website, carefully draw the eCDF by hand. Make…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

1. MSE in terms of bias (Total 5 points)

For some estimator , show that MSE = bias2( ) + Var( ). Show your steps clearly.

2. Practice with empirical CDF (eCDF) (Total 5 points)

Using the first 10 samples from the collisions.csv file on the class website, carefully draw the eCDF by hand. Make sure the x‐ and y‐axis clearly indicate the sample points and their corresponding eCDF. Your plot must have y‐limits from 0 to 1, and x‐limits from smallest sample to the largest sample.

3. Programming fun with (Total 15 points)

For this question, we require some programming; you should only use Python. You may use the scripts provided on the class website as templates. Do not use any libraries or functions to bypass the programming effort. Please submit your code as usual in your zip/tar file repo on BB. Provide sufficient documentation so the code can be evaluated. Also attach each plot as a separate sheet (or image) to your submission upload. All plots must be neat, legible (large fonts), with appropriate legends, axis labels, titles, etc.

(a) Write a program to plot (empirical CDF or eCDF) given a list of samples as input. Your plot must have y‐limits from 0 to 1, and x‐limits from 0 to the largest sample. Show the input points as crosses on the x‐axis. (2 points)

(b) Use an integer random number generator with range [1, 99] to draw n=10, 100, and 1000 samples.

Feed these as input to (a) to draw three plots. What do you observe? (3 points)

  1. Modify (a) above so that it takes as input a collection of list of samples; that is, a 2‐D array of sorts where each row is a list of samples (as in (a)). The program should now compute the average across the rows and plot it. That is, for a given x point, first compute the for each row (student),

then average them all out across rows, and plot the average

for x. Repeat for all input points, x.

Show all input points as crosses on the x‐axis.

(2 points)

(d) Use the same integer random number generator from (b) to draw n=10 samples for m=10, 100,

1000 rows. Feed these as input to (d) to draw three plots. What do you observe?

(3 points)

(e) Modify the program from (a) to now also add 95% Normal‐based CI lines for

, given a list of

samples as input. Draw a plot showing and the CI lines for the a3_q3.dat

data file (799 samples)

on the class website. Use x‐limits of 0 to 2, and y‐limits of 0 to 1.

(2 points)

  1. Modify the program from (e) to also add 95% DKW‐based CI lines for . Draw a single plot showing and both sets of CI lines (Normal and DKW) for the a3_q3.dat data. Which CI is tighter? (3 points)

4.

Plug‐in estimates

(Total 10 points)

Show that the plug‐in estimator of the variance of X is

, where

is the

(a)

sample mean,

.

(2 points)

(b)

is

, where

Show that the bias of

is the true variance.

(3 points)

/

(c)

The kurtosis for a RV

is defined a

.

X with mean

and variance

Derive the plug‐in estimate of the

kurtosis in terms of the sample data.

(3 points)

(d)

The plug‐in estimator idea also extends to two RVs. Consider

/

, where

σX

is the standard deviation for RV X. Assuming n i.i.d.

observations for X and Y that appear in pairs

as {(X1, Y1), (X2, Y2), …, (Xn, Yn)}, derive the plug‐in estimator for ρ. (Hint: What is the ePMF for the

event X=X1 AND Y=Y1? What about for the event X=X1 AND Y=Y2?)

(2 points)

5. Consistency of eCDF

(Total 10 points)

Let D={X1, X2, …, Xn} be a set of i.i.d. samples with true CDF F. Let

be the eCDF for D, as defined in class.

(a) Derive E(

) in terms of F. Start by writing the expression for

at some α.

(3 points)

(b) Show

that bias( ) = 0.

(2 points)

terms of F and n.

(3 points)

(c)

Derive se(

) in

(2 points)

(d)

Show that

is a consistent estimator.

6. Properties of estimators

(Total 10 points)

(a) Find the bias, se, and MSE in terms of for

, where Xi are i.i.d. ~ Bernoulli(θ). Hint:

Follow the same steps as in class, assuming the true distribution is unknown. Only at the end use the fact that the unknown distribution is Bernoulli(θ) to get the final answers in terms of . (5 points)

(b) Derive the Normal‐based (1‐α) CI for . Explain why Normal‐based CIs are applicable here.(5 points)

7. Kernel density estimation (Total 15 points)

This question asks you to implement Kernel density estimator (KDE) from scratch and evaluate it for a sample dataset, a3_q7.csv. Do not use inbuilt KDE functions. But, you can use inbuilt pdf functions to estimate pdf at a point. The formal definition of KDE, which estimates pdf, is:

(1)

where K(.) is called the kernel function which should be a smooth, symmetric and a valid density function. Parameter h > 0 is called the smoothing bandwidth that controls the amount of smoothing.

  1. For the a3_q7.csv dataset, the true distribution is Normal(0.5, 0.01) (the mean value μ is 0.5 , the variance is 0.01). The task here is to implement a KDE function using the Normal distribution as the kernel, normal_kde(x,h,D) in python, where x is the point at which the pdf is to be estimated, h is the bandwidth and D is the list of data points. Implement the function as normal_kde.py by first

computingfor all data points xi in given dataset, where K(u) is the pdf of the standard

Normal at point u =

, and then summing up all K() values and dividing by nh, where n is number

of data points, as in Equation (1) above. Submit your code.

(3 points)

  1. Obtain the p.d.f. for x = {0, 0.01, 0.02, …,1} and compute the sample mean and sample variance (use result of Q4(a) as needed) for h=0.0001, 0.0005, 0.001, 0.005, 0.05. Report the deviation (as a percentage difference with respect to true mean or variance) of the estimates from the original distribution (Normal(0.5, 0.01)) in each of the 5 cases. Show on a single plot the pdf of the original

Normal and the KDE estimates of the pdf for all 5 bandwidths. Include this plot in your submission.

Which of the h values performs best? (6 points)

(c) Repeat (a) and (b) above when using the uniform kernel (implement as uniform_kde(x,h,D) as

where

=

uniform_kde.py) with the function K(u) described as

½

1

,

1

u

K u

,

and Triangular distribution,

0

triangular_kde(x,h,D) (implement as triangular_kde.py), using triangle

kernel described as K(u) = 1‐|u| for |u| ≤ 1 (and K(u) = 0 otherwise), where u =

Repeat all parts

of (b) for these two kernels for all 5 bandwidth values and report the percentage.

deviation from

original mean and variance, plot the KDE estimates, and report the best bandwidth for each kernel

choice. (6 points)

CSE -Assignment 3: Non‐Parametric Inference Solution
$30.00 $24.00