Homework 1 Solution

Description

5/5 – (2 votes)

The Monty Hall Problem [10 pts]

You are racing in a TV game show and the host asks you to choose one of 3 doors: Behind one is a car, behind the others there are goats. After you pick a door, the host reveals one of the other two doors, behind which there is a goat. As a second chance, you can now switch your selected door if you want.

As an example, assume that you have selected Door 1 initially. Now the host reveals that Door 3 has a goat behind. Now, you can either stay with Door 1 or switch to Door 2 instead.

Question 1.1 [4 pts] Explain how switching your door a ects your chances to win the car.

Question 1.2 [6 pts] Now, assume the TV show has N doors instead of 3 as the other rules are kept the same i.e. only one door has the car and the others have goats. The host opens a door with a goat behind after you choose one of N doors and then you have the chance to switch your door with one of remaining

Figure 1: The Monty Hall Problem Example

N-2 doors or you can choose not to switch and go with your original door. In this case, prove or disprove explicitly if switching the original door always increases your winning chance.

MLE and MAP [15 pts]

Suppose that 20 identical coins are ipped independently in an experiment where 7 out of 20 coin ips are observed to be heads. Probability of heads is .

Question 2.1 [5 pts] Find MLE estimate of .

Question 2.2 [7 pts] Assume that before 20 coin ip experiment we ip N₁ +N₂ coins in order to construct prior belief on parameter . The outcome of N₁ + N₂ ips are given as follows;

N₁ heads are observed N₂ tails are observed

Using Beta( ; ) distribution[1] as prior on , nd MAP estimate of as a function of N₁ and N₂ explicitly.

Question 2.3 [3 pts] Use Beta( ; ) distribution as prior on to show the relation between MLE and MAP estimates of if no experiment is conducted a-priori. Show your steps explicitly.

DrugBank Drug Target Identi ers Data Set [25 pts]

Biological sciences are becoming data-rich and information-intensive. Machine learning applications have become very useful and popular tools for resolving important questions in biology by enabling to analyze vast amount of biological information. In this question, you will be introduced to one of the biological data sets, DrugBank, and gain intuition of feasibility of kNN algorithm using this data set.

Data set

The data set contains their 4765 unique proteins and 136 unique drugs that target those proteins. [2] Your job is to predict whether a protein is pharmacologically active, i.e. directly related to the mechanism of action for at least one of the associated drugs, or not by using the target information of drugs.

The data has been split into training and validation subsets with the ratio of 3812/953 i.e. 20% of the proteins are taken for the validation set. You will use the following les:

question-3-train-features.csv

question-3-train-labels.csv question-3-valid-features.csv question-3-valid-labels.csv

question-3-train-protein-index.csv question-3-valid-protein-index.csv question-3-drug-index.csv

The les that end with features.csv contain the features and the les ending with labels.csv contain the ground truth labels.

In the feature les, each row contains the feature vector for a protein. The j-th term in a row i, is the target information of the j-th drug for the i-th protein. The label les include the ground truth label for the corresponding proteins where the order of the proteins (rows) are the same as the features le. That is the i-th row in the les corresponds to the same protein. Any protein is labeled as either active (1) or inactive (0).

The les ending with index.csv are the index les in which the rst element in j-th is the protein or drug that j-th row or column represents in the feature les.

Question 3.1 [5 pts] Take a quick look to your feature les and decide on a distance metric for your kNN classi er. Brie y explain your motivation for choosing your distance metric. What would happen if you used another metric to calculate distances?

Question 3.2 [12 pts] Train your kNN classi er 8 k 2 [1, 3, 5, 10, 20, 50, 100, 200]. You do not have to use all values if you observe a trend in the outputs. Report accuracy and precision on validation set for each of your training sessions. You may use plots or tables. Comment on your results. What have you expected and are the result parallel or contradictory to your expectations?

In addition, make note of training time and validation time of your classi er for the next question.

Question 3.3 [8 pts] Comment on the run times that you have noted in the previous question. You may use plots or tables to support your comments. Find the complexity of the brute-force kNN algorithm and report it using big O notation. In which cases using kNN would not be a feasible solution?

Sentiment Analysis on Emails [50 pts]

As a computer scientist working for an online science magazine, your job is to analyze online data to classify mails from your subscribers according to their topics.

Data set

Your dataset is a preprocessed and modi ed version of 20 News Group Data Set [3]. It is based on 4000 real emails about 4 di erent topics in science. Emails have been preprocessed in the following ways:

Stop word removal: Words like \and”,\the”, and \of”, are very common in all English sentences and are therefore not very predictive. These words have been removed from the emails.

Removal of non-words: Numbers and punctuation have both been removed. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character

Removal of infrequent words: Words that occur only once in all data set are removed from emails in order to reduce the size of the data.

The data has been already split into two subsets: a 3200-email subset for training and a 800-email subset for testing (consider this as your validation set and imagine there is another test set which is not given to you). Features have been generated for you. You will use the following les:

question-4-train-features.csv question-4-train-labels.csv question-4-test-features.csv question-4-test-labels.csv

question-4-vocab.txt

The les that end with features.csv contain the features and the les ending with labels.csv contain the ground truth labels.

In the feature les each row contains the feature vector for an email. The j-th term in a row i is the occurrence information of the j-th vocabulary word in the i-th email. The size of the vocabulary is 37358. The label les include the ground truth label for the corresponding email (label 0 is medicine, label 1 is space, label 2 is cryptology and label 3 is electronics), the order of the emails (rows) are the same as the features le. That is the i-th row in the les corresponds to the same email. Each email is labeled as either cryptology, space, medicine or electronics.

The le ending with vocab.txt is the vocabulary le in which the j-th word (feature) in the le corresponds to the j-th feature in both train and test sets.

Bag-of-Words Representation and Multinomial Naive Bayes Model

Notice that the bag-of-words document representation assumes that the probability of a word appearing in an email is conditionally independent of the word position given the class of the email. If we have a particular email document D_i with n_i words in it, we can compute the probability that D_i comes from the class y_k as:

ⁿi
j^Y
P (D_i j Y = y_k) = P (X₁ = x₁; X₂ = x₂; ::; X_n_i = x_n_i j Y = y_k) = P (X_j = x_j j Y = y_k)	(4.1)
=1

In Eq. (4.1), X_j represents the j^th position in email D_i and x_j represents the actual word that appears in the j^th position in the email, whereas n_i represents the number of positions in the email. As a concrete example, we might have the rst email (D₁) which contains 200 words (n₁ = 200). The document might be of space email (y_k = 1) and the 15^th position in the email might have the word \saturn” (x_j = \saturn”).

In the above formulation, the feature vector X has a length that depends on the number of words in the email n_i. That means that the feature vector for each email will be of di erent sizes. Also, the above formal de nition of a feature vector ~x for a email says that x_j = k if the j-th word in this email is the k-th word in the dictionary. This does not exactly match our feature les, where the j-th term in a row i is the number of occurrences of the j-th dictionary word in that email i. As shown in the lecture slides, we can slightly change the representation, which makes it easier to implement:

V
j^Y
P (D_i j Y = y_k) = P (X_j j Y = y_k)^t^wj^;i	(4.2)
=1

where V is the size of the vocabulary, X_j represents the appearing of the j-th vocabulary word and t_w_j_;i de-notes how many times word w_j appears in an email D_i. As a concrete example, we might have a vocabulary of size of 1309, V = 1309. The rst email (D₁) might be about space (y_k = 1) and the 80-th word in the vocabulary, w₈₀, is \planet” and t_w₈₀_;1 = 2, which says the word \planet” appears 2 times in the email D₁. Contemplate on why these two models (Eq. (4.1) and Eq. (4.2)) are equivalent.

In the classi cation problem, we are interested in the probability distribution over the email classes (in this case medical, space, cryptology and electronics emails) given a particular email D_i. We can use Bayes Rule to write:

P (Y = y_k) ^Q^V_j=1 P (X_j j Y = y)^t^wj;i

^P ^(Y ⁼ ^y^k^jDⁱ^{) =} ^P_k _P _(Y ₌ _y_k₎ ^Q^V_j=1 _P _(X_j _j _Y ₌ _y_k₎^twj;i

Note that, for the purposes of classi cation, we can actually ignore the denominator here and write:

P (Y = y_kjD_i) / P (Y = y_k) ^Y P (X_j j Y = y)^t^wj^;i

j=1

^{y^i = arg max P (Y = yk j Di) = arg max P (Y = yk) Y P (Xj j Y = yk)twj};i

y_k	y_k	j=1
		j=1

Question 4.1 [2 points] Explain why the denominator can be ignored in Eq. (4.3).

(4.3)

(4.4)

(4.5)

Probabilities are oating point numbers between 0 and 1, so when you are programming it is usually not a good idea to use actual probability values as this might cause numerical under ow issues. As the logarithm is a strictly monotonic function on [0,1] and all of the inputs are probabilities that must lie in [0,1], it does not have an a ect on which of the classes achieves a maximum. Taking the logarithm gives us:

^{y^}i ^{= arg}_y	0	k	V	w_j;i	j j	k	1
max	@	log P (Y = y ) +	X_j		log P (X	Y = y )	A	(4.6)
max		log P (Y = y ) +	t		log P (X	Y = y )		(4.6)
			=1

where y^_i is the predicted label for the i-th example.

The parameters to learn and their MLE estimators are as follows:

			j y=space			^Tj;y=space
			j y=space		V
						₌₁ ^Tj;y=space
			j	^P^jT_j;y=medicine
	j y=medicine
	j y=medicine				V
						₌₁ ^Tj;y=medicine
		j			^P^jT_{j;y=electronics}
j y=electronics
j y=electronics					P	V
	j					₌₁ ^Tj;y=electronics
	j					^jT_{j;y=cryptology}

	j j y=cryptology				P	V
	j j y=cryptology				P	_j=1 ^Tj;y=cryptology

_y=space P (Y = space) = ^N^space

T_j;space is the number of occurrences of the word j in space emails in the training set including the multiple occurrences of the word in a single email.

T_j;medicine is the number of occurrences of the word j in medicine emails in the training set including the multiple occurrences of the word in a single email.

T_{j;electronics} is the number of occurrences of the word j in electronics emails in the training set including the multiple occurrences of the word in a single email.

T_j;cryptology is the number of occurrences of the word j in cryptology emails in the training set including the multiple occurrences of the word in a single email.

N_space is the number of space emails in the training set. N is the total number of emails in the training set.

_y=space estimates the probability that any particular email will be about space.

_j _j _y=space estimates the probability that a particular word in a space email will be the j-th word of the vocabulary, P (X_j j Y = space)

_j _j _y=medicine estimates the probability that a particular word in a medicine email will be the j-th word of the vocabulary, P (X_j j Y = medicine)

_j _j _{y=electronics} estimates the probability that a particular word in an electronics email will be the j-th word of the vocabulary, P (X_j j Y = electronics)

_j _j _y=cryptology estimates the probability that a particular word in a cryptology email will be the j-th word of the vocabulary, P (X_j j Y = cryptology)

Question 4.2 [3 points] How many parameters do we need to estimate for this model?

Question 4.3 (Coding) [25 points] Train a Naive Bayes classi er using all of the data in the training set

question-4-train-features.csv and question-4-train-labels.csv). Test your classi er on the test data (question-4-test-features.txt and question-4-test-labels.txt), and report the testing accuracy and confusion matrix as well as how many wrong predictions were made. In es-timating the model parameters use the above MLE estimator. If it arises in your code, de ne 0 log 0 = 0 (note that a log 0 is as it is, that is -inf ). In case of ties, you should predict \space”. What did your classi er end up predicting? Why is using the MLE estimate a bad idea in this situation?

Question 4.4 (Coding) [5 points] Extend your classi er so that it can compute an MAP estimate of parameters using a fair Dirichlet prior. This corresponds to additive smoothing. The prior is fair in the sense that it assumes that each word appears additionally times in the train set.

j j y=space

^Tj;y=space⁺

_j=1 ^Tj;y=space+ V

j y=medicine

^Tj;y=medicine⁺

₌₁ ^Tj;y=medicine⁺

^Pj

T_{j;y=electronics}+

j y=electronics

₌₁ ^Tj;y=electronics⁺ ^V

^Tj;y=cryptology⁺

j j y=cryptology

_j=1 ^Tj;y=cryptology⁺ ^V

_y=space P (Y = space) = ^N^space

For this question set = 1. Train your classi er using all of the training set and have it classify all of the test set and report test-set accuracy and confusion matrix. Comment on the results.

Question 4.5 [2 points] What would happen as we increase in fair Dirichlet prior ? What would you end up predicting when = 1 is used ? Explain clearly.

Question 4.6 (Coding) [5 points] Using question-4-vocab.txt le, nd the most commonly used 20 words in each email class in the training set and make comments on them. Do you think the most common words are as expected? Does the model that you have constructed is interpretable? Explain clearly.

Question 4.7 (Coding) [6 points] For each class of emails, nd 1 instance in test set that has the highest probability and 1 instance that has the lowest probability. You have to report 8 instances in total (you can simply report the indices of these instances in test set provided to you). Brie y discuss why these instances are easy/hard to predict.

Question 4.8 [2 points] Describe the di erences between Bernoulli Naive Bayes model and Multinomial Naive Bayes model. To use Bernoulli model for email class prediction, what would you change in your previous code? Explain clearly.

References

Beta distribution. https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading14a.pdf

DrugBank – Drug Target Identi ers Data Set https://www.drugbank.ca/releases/latest# protein-identifiers

20 News Group Data Set http://qwone.com/_˜jason/20Newsgroups/

“On Discriminative vs. Generative Classi ers: A comparison of logistic regression and Naive Bayes” by Andrew Ng and Michael I. Jordan.

Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press. http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html