1 Feature Selection [60 pts]
Given a dataset S = {(Y i, Xi)}ni=1 of n instances, where features X = (X1, . . . , Xd) 2 Rd, and labels
= {1,…,K}.
For each value of the label Y = k
– Estimate density p(Y = k)
For each feature Xi, i = {1, . . . , d}
– Estimate its density p(Xi)
– For each value of the label Y = k, estimate the density p(Xi|Y = k)
– Score feature Xi, i = {1, . . . , d}, using
xi2XX,y2Y p(xi, y) log2(
p(xi, y)
I(Xi, Y ) =
where X and Y denote the support sets of Xi and Y .
Choose those feature Xi with high score Ii
Insight: Informativeness of a feature
We are uncertain about label Y before seeing any input.
– Suppose we quantify using entropy H(Y ), defined as
H(Y ) = − p(y) log2 p(y) (2)
where Y denotes the support sets of Y .