Description
-
In class we derived the generalization-error bound for a C-class problem with C > 2, from the training-set error, based on the growth function ℋ(2 ). In this problem, you will derive the generalization-error bound for a C-class problem from the test-set error and from a validation-set error with finite M.
Throughout this problem:
let ‘ denote the (in-sample) unnormalized confusion matrix based on dataset , so that entry ) ‘ *#$= [number of data points labelled = that were misclassified as
ℎ= ];
also, let ) out*#$ = [(ℎ = ) ( = )] be the th entry of the out-of-sample confusion matrix out .
-
For a given single hypothesis h for the C-class problem (so ℎ ∈ {1,2, ⋯ , }) tested using dataset that has N data points, give an expression for the total number of points that were misclassified mis , in terms of the entries ) ‘ *#$ . Also give an expression for the error rate on , (ℎ) , in terms of the entries
)’ *#$.
For the out-of-sample confusion matrix, give an expression for the total probability of error (ℎ ≠ ) in terms of the entries of out . Hint: are the events for ) out* #$ and ) out* -. mutually exclusive?
Use these results to give expressions for= [incorrect classification] and = percent misclassified by h on .
Apply Hoeffding Inequality to µ and n.
Write the resulting expression in terms of and out . Reformulate to give an expression in the following form:
[ out(ℎ) ≤ (ℎ) + ( )] ≥ 1 − d .
in which you fill in for ( ). Hint: this is similar to what we did in Lecture 7 for the C = 2 case.
Is this a generalization-error bound for test-set error, for a C > 2 class problem?
Comment: As you may have observed in the Midterm Assignment Pr. 1, the generalization-error bound based on a test set can be much tighter than the bound based on a training set and its VC dimension.
Hint: does the same technique applying a union bound that we did for the 2-class problem (Lecture 7) apply?
-
This problem concerns the generalization error bound in a transfer learning problem, as given in Lecture 13 (v2.1), Eq. (6).
In this problem you will study the effects of varying 2, 3, and a on the cross-domain generalization error bound.
Throughout this problem, let ε45 be everything in the cross-domain generalization-error bound (RHS of Lecture 13 (v2.1) Eq. (6)), except omitting 2,3∗. Note that 2,3∗ is a constant of the parameters we will be varying.
Also throughout this problem, use the values 89 = 10, δ = 0.1, ℋ:ℋ = 0.1 . However, leave them as variables until you are ready to plot, or until you are asked for a number.
-
-
Give the simplified number (to two decimal digits) for ε45, for the following cases:
-
-
-
-
3 = 1, 2 = 100, = 0.1, 0.5, 0.9
-
3 = 10, 2 = 1000, = 0.1, 0.5, 0.9
-
3 = 100, 2 = 10000, = 0.1, 0.5, 0.9
-
-
-
-
-
3 = 1000, 2 = 100000, = 0.1, 0.5, 0.9
-
-
-
Tip: put these in a table for easy viewing.
(v) Do any of these sets of numbers assure some degree of generalization (i.e.,
ε45 < 0.5
, assuming
2,3
≈ 0)
?
If so, which?
∗
Comment: As in the supervised learning case, these bounds can be very loose,
but evidence indicates the functional dependence of
ε45
on its variables still
generally apply.
2 = 1000
3 =
(b)
For this
part,
let
and plot
ε45
vs. a
for
10, 100, 1000, 10000
(4 curves on one plot), over
0≤ ≤1
. Answer: what
, 3
approximate value of a is optimal for each value of
?
Try to explain the
dependence
of
of a.
on a for
different values of
3
and
any
difference in
optimal values
ε45
(c)
For this part, let 3 = 100 and plot
ε45 vs. a for 2 = 10, 100, 1000, 10000
-
4 curves on one plot), over 0 ≤ ≤ 1. Answer: what approximate value of a is optimal for each value of 3 ? Try to explain the dependence of ε45 on a for different values of 2 , and any difference in optimal values of a.
Common default values for a are α = 0.5 and α = β.
-
-
-
In terms of minimizing the cross-domain generalization-error bound, which default choice looks better (based on your answers to (b) and (c) above)? Is that choice reasonably consistent with your results of (b) and (c)?
-
Give algebraic expressions for ε45(α = 0.5) and ε45(α = β). Compare them algebraically: can you draw any conclusions about which is lower?
-
-
-
-
-
Plot ;<( = 0.5) vs. N for = 0.01, 0.1, 0.5, for 1000 ≤≤ 100000
-
-
(3 curves on 1 plot). Repeat for ;<( = ). What conclusions can you draw from the plots?
-
(a) Is it possible to have a covariate shift while satisfying all of:
2( | ) = 3( | ), 2( ) = 3( ), 2( | ) = 3( | ) ?
If no, prove your answer; if yes, justify your answer.
-
Is it possible to have a covariate shift while satisfying:
2( | ) = 3( | ) ?
If no, prove your answer; if yes, justify your answer.
-
Is it possible to have a concept shift while satisfying:
2( ) = 3( ) and 2( ) = 3( ) ?
If no, prove your answer; if yes, justify your answer.
p. 3 of 3