Description
Problem 1 [30%]
This problem examines the use and assumptions of LDA and QDA. We will be using the dataset Default from ISLR.
-
Split the data into a training set (70%) and a test set (30%). Then compare the classification error of LDA, QDA, and logistic regression when predicting default as a function of features of your choice. Which method appears to work best?
-
Report the confusion table for each classification method. Make sure to label which dimension is the predicted class and which one is the true class. What do you observe?
-
Are the LDA assumptions satisfied when predicting default as a function of balance only (i.e default ~ balance)? You can use qqnorm and qqline to examine whether the conditional class distributions are normally distributed. Also examine standard deviations of the class distributions. Are the QDA assumptions satisfied?
-
Would you ever want to use LDA in place of QDA even when you suspect that some of the assumptions are violated (e.g. different conditional standard deviations) for LDA?
Hint: Check out TidyVerse for a collection of packages that can help with data manipulation. And see the Rstudio cheatsheets for a convenient and concise reference to the methods. This is entirely optional!
Problem 2 [30%]
Using the MNIST dataset, fit classification models in order to predict the digit 1 (vs all others).
-
Compare the classification error for each one of these methods:
-
Logistic regression
-
K-NN with 2 reasonable choices of k
-
LDA
-
Explore at least one transformation of the features (predictors), such as considering their combinations, and run the methods from part 1 on the data.
-
Which one of the methods works the best?
Make sure to split the data into a training set and a test set. No need to run on the entire dataset; a subsample of say 10000 datapoints is OK.
Hint: There is a file in the gitlab repository: assignments/mnist_simple.Rmd which you can use as a starting point. If you are using Python, please checkout this package. If you have trouble getting started, please do not hesitate to ask the instructor or the TAs or Piazza for help.
1
This problem can be substituted for Problem 2 above, for up to 5 points extra credit. The better score from problems 2 and O2 will be considered.
Solve Exercises 1.11 and 1.13 in [Bishop, C. M. (2006). Pattern Recognition and Machine Learning].
Problem 3 [20%]
Logistic regression uses the logistic function to predict class probabilities:
eβ0+β1X
p(X) = 1 + eβ0+β1X
This is equivalent to assuming a linear model for the prediction of the log-odds:
-
log
1 −(p(X)
=β0+β1X
p X)
Using algebraic manipulation, prove that these two expressions are identical. See Section 4.3 in ISLR and equations (4.2) & (4.3) for more context.
Problem 4 [20%]
This problem examines the differences between LDA and QDA.
-
For an arbitrary training set, would you expect for LDA or QDA to work better on the training set?
-
If the Bayes decision boundary between the two classes is linear, would you expect LDA or QDA to work better on the training set? What about the test set?
-
As the sample size increases, do you expect the prediction accuracy of QDA with respect to LDA increase or decrease
-
True or False: Even if the Bayesian decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is more flexible and can model a linear decision boundary. Justify your answer.
2