Description
Problem 1 (8+2 = 10 pts): Apply Variational Autoencoder on the Fashion MNIST Dataset.
-
Use minimum of 3 convolutional layers in the encoder and 3 deconvolutional layers (Conv2DTranspose/ upscale) in the decoder.
-
Display how the latent space clusters different classes of the training data.
Problem 2 (2+ 8 = 10 pts): Use the IMDB Movie review dataset:
-
Perform Text Preprocessing
-
-
Tokenization
-
-
-
Stopwords removing
-
-
-
HTML removing
-
-
-
Convert to lower case
-
-
-
Lemmatization/stemming
-
-
Build the following sentiment analysis models and create a performance comparison table:
-
-
TF-IDF + GausianNB
-
-
-
Word2Vec (CBoW) + GausianNB
-
-
-
Glove + GausianNB
-
BONUS (2 pts): Write in your own words what is Byte Pair Encoding (BPE) and mentioned the steps involved in BPE tokenization. Apply Byte Pair Encoding (BPE) for 5%, 10%, 15% and 20% of the IMDB training dataset. Compare the BPE in terms of number of generated tokens for the varying datasets.
Useful link-
Data Download:
Fmnist: keras- https://keras.io/api/datasets/fashion_mnist/
Keras example – https://keras.io/examples/generative/vae/
BPE resources –
Huggingface BPE – https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
https://www.geeksforgeeks.org/byte-pair-encoding-bpe-in-nlp/
https://towardsdatascience.com/training-bpe-wordpiece-and-unigram-tokenizers-from-scratch-using-hugging-face-3dd174850713
CIFAR10: Keras: https://keras.io/api/datasets/cifar10/
Pytorch: https://pytorch.org/vision/stable/generated/torchvision.datasets.CIFAR10.html
Movie review: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
Gensim- https://radimrehurek.com/gensim/models/word2vec.html
You are required to submit:
-
An MS/PDF/Scanned document:
-
-
Include all the steps of your calculations.
-
-
-
Attach screenshots of the code output.
-
-
-
Include the summary of the model
-
-
-
Include a Table – Mention all the hyperparameters you selected: activation function in hidden layer and output layer, weight initializer, number of hidden layers, neurons in hidden layers, loss function, optimizer, number of epochs, batch size, learning rate, evaluation metric
-
-
Source code:
-
-
Python (Jupyter Notebook)
-
-
-
Ensure it is well-organized with comments and proper indentation.
-
-
Failure to submit the source code will result in a deduction of 5 points.
-
Format your filenames as follows: “your_last_name_HW1.pdf” for the document and “your_last_name_HW1_source_code.ipynb” for the source code.
-
Before submitting the source code, please double-check that it runs without any errors.
-
Must submit the files separately.
-
Do not compress into a zip file.
-
HW submitted more than 24 hours late will not be accepted for credit.