Description
-
(15 points) Backpropagation for autoencoders. In an autoencoder, we seek to recon-struct the original data after some operation that reduces the data’s dimensionality. We may be interested in reducing the data’s dimensionality to gain a more compact representation of the data.
For example, consider x 2 Rn. Further, consider W 2 Rm n is of lower dimensionality than x. One way to design W so features of x is to minimize the following expression
-
L =
1
WT Wx
x
2
2
where m < n. Then Wx that Wx still contains key
with respect to W. (To be complete, autoencoders also have a nonlinearity in each layer,
-
i.e., the loss is 21
f(WT f(Wx))
x
2
. However, we’ll work with the linear example.)
-
-
(3 points) In words, describe why this minimization nds a W that ought to preserve information about x.
-
-
-
(3 points) Draw the computational graph for L.
-
-
-
(3 points) In the computational graph, there should be two paths to W. How do we account for these two paths when calculating rWL? Your answer should include a mathematical argument.
-
-
-
(6 points) Calculate the gradient: rWL.
-
-
(20 points) Backpropagation for Gaussian-process latent variable model. An im-portant component of unsupervised learning is visualizing high-dimensional data in low-dimensional spaces. One such nonlinear algorithm to do so is from Lawrence, NIPS 2004, called GP-LVM. GP-LVM optimizes the maximum-likelihood of a probabilistic model. We won’t get into the details here, but rather to the bottom line: in this paper, a log-likelihood has to be di erentiated with respect to a matrix to derive the optimal parameters.
To do so, we will use apply the chain rule for multivariate derivatives via backpropagation. The log-likelihood is:
-
L = c
D
log jKj
1
tr(K 1YYT )
2
2
where K = XXT + 1I and c is a constant. To solve this, we’ll take the derivatives with
1
respect to the two terms with dependencies on X:
-
L1
=
D
log j XXT +
1Ij
2
L2
=
1
tr ( XXT +
1I) 1YYT
2
Hint: To receive full credit, you will be required to show all work. You may use the following matrix derivative without proof:
-
@L
=
K T
@L
K T
@K
@K 1
-
-
(3 points) Draw a computational graph for L1.
-
-
-
(6 points) Compute @@LX1 .
-
-
-
(3 points) Draw a computational graph for L2.
-
-
-
(6 points) Compute @@LX2 .
-
-
-
(2 points) Compute @@XL.
-
-
(40 points) 2-layer neural network. Compete the two-layer neural network Jupyter note-book. Print out the entire workbook and relevant code and submit it as a pdf to gradescope. Download the CIFAR-10 dataset, as you did in HW #2.
-
(25 points) General FC neural network. Compete the FC Net Jupyter notebook. Print out the entire workbook and relevant code and submit it as a pdf to gradescope.
2