Description
-
Consider the email spam classification problem of Murphy Problem 8.1. Suppose you intend to use a linear perceptron classifier on that data (not logistic regression as directed in Problem 8.1). In the parts below, unless stated otherwise, assume the
-
dataset of
N = 4601 samples is split into
NTr = 3000 for training and
NTest = 1601
for testing.
Also, for the tolerance δ in the VC generalization bound, use 0.1 (for a
certainty of 0.9). The parts below have short answers.
Hint: You may use the relation that if
H is a linear perceptron classifier in D
dimensions (D features),
dVC (H ) = D + 1 . ( This will be proved in Problem 2.)
a)
What is the VC dimension of the hypothesis set?
b)
Expressing
the
upper
bound
on
the
out-of-sample
error
as
Eout (hg ) ≤ Ein (hg )+ εvc
For Ein (hg )
measured on the training data, use dvc
from part (a) to get a value
for εvc .
-
To get a lower εvc , suppose you reduce the number of features to D = 10 , and also increase the training set size to 10,000. Now what is εvc ?
-
Suppose that you had control over the number of training samples NTr (by
collecting more email data). How many training samples would ensure a generalization error of εvc = 0.1 again with probability 0.9 (the same tolerance
-
-
= 0.1), and using the reduced feature set (10 features)?
-
-
Instead suppose you use the test set to measure Ein (hg ) , so let’s call it Etest (hg ) . What is the hypothesis set now? What is its cardinality?
-
Continuing from part (e), use the bound:
Eout (hg ) ≤ Etest (hg )+ ε
Use the original feature set and the original test set, so that NTest = 1601 . Give an appropriate expression for ε and calculate it numerically.
-
AML Exercise 2.4 (page 52). In addition to the hints given in the book, you can solve the problem by following the steps outlined below.