Description
Problem 1 (30pts). Fundamental Knowledge
-
Clearly state the difference between supervised learning and unsupervised learning.
-
Explain the usage of training set, validation set, and test set in a learning task. Also explain why we need a validation set.
-
Suppose we have a dataset of people in which we record their heights hi, as well as length of left arms li, and right arms ri. Suppose hi ∼ N (10, 2) (in unspecified units), and li ∼ N (ρihi, 0.02) and ri ∼ N (ρihi, σ2), with ρi ∼ Unif(0.6, 0.7). Is using both arms necessarily a better choice than using only one arm to approximate hi? What if σ2 = 0.02? Explain the intuition.
-
Let X ∈ Rn×d be a full column rank matrix. Explain why XT X is positive definite using SVD. (Hint: The singular matrices are orthonormal)
-
Consider the problem of
min ∥Xθ − y∥22 + λ∥θ∥22.
θ
Suppose X is full column rank, write down its optimal solution θ∗.
Problem 2 (30pts). Least Square without Full Column Rank Consider the problem
min ∥Xθ − y∥22,
θ
where X ∈ Rn×d, θ ∈ Rd, y ∈ Rn.
(1) Given
X = 1 0 , y = 1 .
Draw the figure of the objective function using python.
(2) The thin SVD of X is given by |
||
UT |
= VΣ1U1T . |
|
X=V Σ1 0 |
T1 |
|
U2 |
Show that when n < d, optimal solutions are non-unique. Derive the expression of the optimal
solutions using thin SVD. (Hint: Let A := VΣ1, z := UT1 θ. Solve ∥Az − y∥22 first, then solve UT1 θ = z)
Problem 3 (50pts). A Robust LP Formulation Suppose we have the generative linear regression model
-
=Xθ⋆+ϵ,
where ϵ is the error term and ϵ ∼ N(0, Σ). The maximum likelihoog estimator for θ is:
ˆ |
2 |
||||||||||||||
θLS = argmin |
∥Xθ − y∥2 |
||||||||||||||
θ∈Rd |
|||||||||||||||
= (XT X)−1XT y. |
|||||||||||||||
distribution, |
i.e. |
ε |
i.i.d |
||||||||||||
(a) Suppose |
the error |
term, ϵ = [ε1, ε2, · · · , εn] |
follows the Laplace |
i |
2b |
ε |
0 |
i |
∼ |
||||||
· · · |
| i− | |
||||||||||||||
L(0, b), i |
= 1,2, |
, n and the probability density function is P (ε |
) = |
1 |
e− |
for some |
|||||||||
b |
|||||||||||||||
b > 0. Under the MLE principle, what is the learning problem? Please write out the derivation process. (15 points)
Figure 1: PDF of Laplace distribution
(b) Huber-smoothing. L1-norm minimization
-
-
-
-
-
-
-
ˆ
=
θ
∥
−
∥1
θL1
Xθ
y
argmin
-
-
-
-
-
-
is one possible solution for robust regression. However, it is nondifferentiable. We utilize smoothing technique for approximately solving the L1-norm minimization. Huber function is one possibility. The definition and sketch map are shown as below.
-
-
-
-
-
-
-
hµ(z) z2
|z|,
µ
|z| ≥ µ
+
,
|z| ≤ µ
2µ
2
-
-
-
-
-
-
Then,
n
Hµ(Z) = hµ(zj).
j=1
By using Huber smoothing, the approximation of the optimization of L1−norm can be changed to
min Hµ(Xθ − y).
θ
Let
f(θ) = Hµ(Xθ − y),
find the gradient ▽f(θ).(10 points)
0.8
0.6
0.4
0.2
–
Figure 2: Huber smoothing
-
Gradient descent for minimizing f(θ). The process of gradient descent algorithm is shown in the following table.
-
-
Input: observed data X,y and initialization parameter θ0 Huber smoothing parameter µ,
-
total iteration number T , learning rate α.
-
-
for k = 1, 2, · · · , T ,do
-
-
-
θk+1 = θk − α▽f(θk)
-
end for
-
return θT
-
The data set is generated by the linear model
-
=Xθ⋆+ϵ1+ϵ2,
where ϵ1 ∈ R n follows Gaussian distribution, ϵ2 are outliers. Given the observed data (x, y) = {(x1, y1), (x2, y2), · · · , (xn, yn)} and true value θ⋆,
-
ˆ
ˆ
⋆
(1) calculate the estimation θLS
by using linear least squares and compute
θLS−θ
2
.(5
points)
-
suppose n = 1000, d = 50, use python to implement the gradient descent algorithm to minimize f(θ),the parameters are set as µ = 10−5, α = 0.001, T = 1000, plot the error ∥θk − θ⋆∥2 as a function of iteraction number.You can download the data {y, X, θ⋆} from Blackboard.(20 points)
4