Description
Meeting Schedule (5 points)
You are required to set at least two virtual meeting times to work with you group on this homework. You should decide these times at the beginning of the homework period. When you submit your homework, you should include the dates, times, and attendees of these meetings.
Note: If you feel you are unable to nd times when everyone can meet, please make a note on Piazza, at least 1 week before the deadline, and we will work out an arrangement.
1
Homework 4 Last Updated: April 20, 2020
Problems
-
SVM Theory (20 points CSC 522 / 8 points CSC 422) [Yang Shi].
-
-
Support vector machines (SVM) learn a decision boundary leading to the largest margin between classes. In this question, you will train a SVM on a tiny dataset with 4 data points, shown in Figure 1. This dataset consists of two points with Class 1 (y = 1) and two points with Class 2 (y = -1). Each data point has two non-class attributes: X1 and X2.
-
10
9
8
7
X2
6
5
4
3
3 4 5 6 7 8 9
X1
Figure 1: SVM data points for 1(a)
-
-
Find the weight vector w and bias w0 for the decision boundary of a hard-margin SVM. What is the equation corresponding to this decision boundary?
-
-
-
Circle the support vectors and draw the decision boundary.
-
-
(12 points) Required for CSC522, Extra Credit for CSC 422): You are given 1-dimensional data points Xi; i 2 [1; 2; 3; 4; 5; 6] as shown in Table 1 ,also shown in Figure 2 in this question.
-3 -2 -1 0 1 2
Figure 2: SVM data points for 1(b)
-
Data ID
x
y
X1
-3
-1
X2
-2
-1
X3
-1
1
X4
0
1
X5
1
1
X6
2
-1
Table 1: Six Data Points
Use this data to answer the following questions:
i. Calculate the equation for the decision boundary of a hard-margin SVM, or if this is not possible, explain why in 1-2 sentences.
-
If you were to train a soft-margin SVM on this data, would you select a C value where C ! 0 or C ! 1. Explain why in 1 sentence.
2
Homework 4 Last Updated: April 20, 2020
-
Imagine you want to transform the 6 given data points to a higher dimensional space. You decide to use the kernel function K(Xi; Xj ) = (1 + 2XiXj )2, which is equal to (Xi) (Xj ). What is the function (X)? How many dimensions is the transformed data?
-
Use the function (X) to calculate (Xi) for i 2 [1; 2; 3; 4; 5; 6]. Graph these data points in the higher-dimensional space. (Hint: If the data is more than 2-dimensional, can you simplify your visualization to show it in 2D?)
-
Is it possible to linearly separate the data in the higher-dimensional space? If so, draw the decision boundary in your graph. If not, explain why. Note: You do not have to calculate the weights, just draw the decision boundary.
-
You train a hard-margin SVM on the higher-dimensional data using a library and it gives you the following Lagrange multiplier for your data1: 2 = 0:1, 3 = 0:1, 5 = 0:1, 6 = 0:1. What are the remaining Lagrange multipliers, 1 and 4? Justify your answer in 1-2 sentences. (Hint: This should not require any math to calculate.)
-
Recall that the SVM’s prediction (using the Kernel transformation) for a data point Z can be de ned with the following equation:
X
f(Z) = sign( iyi( (Xi) (Z)) + w0) (1)
i
You are given w0 = -2/3. You are now asked to classify a new test data point, Z, using the SVM de ned earlier by the Lagrange multipliers. You do not know what Z’s attributes are, but you do know: K(X2; Z) = 25, K(X3; Z) = 4, K(X5; Z) = 16, K(X6; Z) = 49. Classify Z using the SVM. (Hint: If you nd yourself trying to solve for Z’s x value, you are doing it wrong.)
-
K-Means Clustering (14 points) [Ge Gao] Use the K-means clustering algorithm with Euclidean Dis-tance to cluster the 10 data points in Figure 3 into 3 clusters. Suppose that the initial seed centroids are at points: C, I and H. The data are also given in tabular format in Table 2.
-
-
After each iteration of k-means, report the coordinates of the new centroids and which cluster each data point belongs to. Stop when the algorithm converges and clearly label on the graph where the algorithm converges. To report your work, either:
-
-
-
-
Draw the result clusters and the new centroid at the end of each round (including the rst round). You can use the image hw4q2 start.jpg, included with your homework, to mark clusters and centroids. Additionally, indicate the coordinates (x; y) alongside corresponding centroids.
-
-
-
-
-
Give your answer in tabular format with the following attributes: Round (e.g. Round 1, 2, etc), Points (e.g. fA, B, Cg), and Cluster ID. Also report the centroids for each cluster after each round.
-
-
-
Note, these are not the actual Lagrange multipliers for the SVM, but assume they are for the purposes of this question.
3
Homework 4 Last Updated: April 20, 2020
Figure 3: K-means Clustering (a)
-
Point
x
y
A
9
5
B
3
1
C
7
4
D
1
9
E
2
2
F
4
6
G
8
8
H
9
2
I
6
6
J
3
3
Table 2: K-means Clustering (b)
-
-
How many rounds are needed for the K-means clustering algorithm to converge?
-
-
Hierarchical Clustering (18 points) [Ge Gao] We will use the same dataset as in Question 2 (shown in Figure 3) for Hierarchical Clustering. The Euclidean Distance matrix between each pair of the datapoints is given in Figure 4 below:
-
-
Perform single link hierarchical clustering. Show your work at each iteration by giving the inter-cluster distances. Report your results by drawing a corresponding dendrogram. The dendrogram should clearly show the order and the height in which the clusters are merged. If possible, use a program to construct your dendrogram (e.g. PowerPoint, LucidChart2, or VisualParadigm3). Scanned hand drawings will also be accepted if they are very clear.
-
-
-
Perform complete link hierarchical clustering on the dataset. As above, show your calculations and report the corresponding dendrogram.
-
4
Homework 4 Last Updated: April 20, 2020
-
If we assume there are two clusters, will the single or complete link approach give a better clus-tering? Justify your answer.
-
Consider the single-link hierarchical clustering with 3 clusters. Compare the quality of this single link clustering with your nal K-means clustering in Question 2. To evaluate the quality of each clustering, calculate its corresponding Sum of Squared Error (SSE). Based on this measure, which clustering (k-means, single link), is best? Do you agree with this assessment? Explain why in 1-2 sentences. Note: you may want to write some code to help speed up these calculations, which you can include in lieu of showing your work.
Figure 4: Hierarchical Clustering Dataset
-
Association Rule Mining (6 points) [Ge Gao]. Consider the following market basket transactions shown in the Table 3 below.
-
Transaction ID
Bread
Milk
Butter
Eggs
Beer
Cola
1
0
0
0
1
0
1
2
1
1
0
0
0
1
3
1
0
1
0
1
0
4
0
1
0
1
0
0
5
0
0
1
0
1
1
6
1
1
1
0
0
1
7
1
1
0
1
0
0
8
0
1
1
1
0
1
9
1
0
1
1
1
0
10
1
1
0
0
0
1
Table 3: For each transaction (row), a 1 indicates that a given item was present in that transaction, and a 0 indicates that it was not.
-
What is the maximum number of unique itemsets that can be extracted from this data set (in-cluding itemsets that have 0 support)? Brie y explain your answer in 1-2 sentences.
-
What is the maximum number of association rules that can be extracted from this data set (including rules that have zero support)? Brie y explain your answer in 2-3 sentences.
-
Compute the support of the itemset: fEggs; Colag?
-
Compute the support and con dence of association rule: fBreadg ! fButterg?
-
Given min support = 0.3 and min con dence = 0.6, identify all valid association rules of the form fA; Bg ! fCg.
-
In a di erent dataset, the support of the rule fag ! fbg is 0.46, and the support of the rule fa; cg ! fb; dg is 0.23. What can we say for sure about the support of the rule fag ! fb; dg. Explain in 1-2 sentences.
5
Homework 4 Last Updated: April 20, 2020
-
Apriori algorithm (12 points) [Yang Shi].
Consider the data set shown in Table 4 and answer the following questions using apriori algorithm.
-
TID
Items
t1
A,B,C,D
t2
A,B,C,E
t3
A,B
t4
A,C,E
t5
A,C,D,E
t6
B,C,D
t7
B,C,E
t8
C,D,E
Table 4: Apriori algorithm
-
-
Show (compute) each step of frequent itemset generation process using the apriori algorithm, with a minimum support count of 3.
-
-
-
Show the lattice structure for the data given in table above, and mark each node in the lattice as either F: Frequent, IC: Infrequent due insu cient support count, or IP: Infrequent due to pruning (we do not need to calculate the support count). (Scanned hand-drawing is acceptable as long as it is clear.)
-
-
Clustering and SVM (25 points) [Yang Shi].
In this question, you will be performing a variety of machine learning operations – clustering and classi cation.
-
-
PART-1: Clustering (15 points)
-
-
-
-
Dataset description: You are provided a dataset with 2 variables (x; y). Your data is stored in the le data/clustering-sample.csv.
-
-
-
-
-
Note: The TA will use a di erent version of data/clustering-sample.csv. The format (variables x ; y, will be similar, but TA’s le may contain di erent number of data points, and may look visually di erent than the le supplied to you. Please ensure you take this into account, and do not hard code any dimensions/outputs.
-
-
-
-
-
In this exercise, you will apply three di erent types of clustering methods to the dataset supplied to you, and then compare their results:
-
-
-
-
-
-
Clustering: You will write code in the function alda cluster() to manually implement KMeans, Single Link and Complete Link clustering. Detailed instructions for implemen-tation and allowed packages have been provided in hw4.R and in hw4 checker.R.
-
-
-
-
-
-
-
SSE Calculation: In the function alda calculate sse(), we have given you code to calculate the total SSE for a given clustering on a given dataset. The total SSE is the sum of squared error, summed over each cluster. To calculate the SSE for a cluster, rst calculate the centroid for the cluster, then calculate to total Euclidean distance (error) from each point to that centroid. Read over the code to check your understanding.
-
-
-
-
-
-
-
Analysis-1: KMeans Elbow Plot: You will write code in the function alda kmeans elbow plot() to generate an elbow plot to help you choose a value of k for k-means clustering. Note
-
-
-
that you can use alda calculate sse() to calculate the SSE for each value of k. Gener-ate this elbow plot and save it as instructed in hw4.R, and place this plot in your PDF. Using this plot, report what you think is the best value of k and your justi cation for this choice in the PDF. Detailed instructions for implementation and allowed packages have been provided in hw4.R and in hw4 checker.R.
-
-
-
-
Analysis-2: Comparison of three clustering methods in terms of SSE: Use the method alda calculate sse() for calculating SSE. You have been given code in hw4 checker.R which prints the SSE values for all the three clustering methods (with k = 2 for k-means). Purely based on SSE, which clustering method do you think is the best? Report this answer in your PDF.
-
-
-
6
Homework 4 Last Updated: April 20, 2020
E. Analysis-3: Visual comparison of three clustering methods: I’ve given you code for visualizing the clusters in hw4 checker.R, which plots the data with their cluster assignments for k = 2 for all the three clustering methods. Based on these plots, which clustering method do you think is the best? Report this answer in your PDF.
F. Analysis-4: Visual comparison vs SSE: Is your answer for visual comparison (C) the same as for SSE (D) the same? What conclusions can you draw from this? How do visualizations compare with numeric measures of cluster quality? Are numeric measures always reliable? Explain your answers in 3-4 sentences.
-
PART-2: Classi cation (10 Points)
-
i. Dataset description: You are provided a dataset with 5 variables. Variables x1
x4
refer to the independent variables, while variable class is your class variable. Training data is stored in the le data/classification-train.csv, and test data is stored in the le data/classification-test.csv.
-
Note: The TA will use a di erent version of data/classification-test.csv. The format (independent variables x1 x4, dependent variable class) will be similar, but TA’s le may contain di erent number of data points than the le supplied to you. Please ensure you take this into account, and do not hard code any dimensions.
-
In this exercise, you will apply an Support Vector Machine (SVM) classi cation method to the dataset supplied to you:
-
-
Support Vector Machine: You will use the e1071 library for training and testing an SVM model. In this exercise, you will use tune function to tune svm models under the selected kernel. After tuning, you will report the best model and the prediction results on the test set generated by the best model. Implement the function alda svm() and tune each hyperparameter as instructed in the comments (Note: some of the hyperparam-eters are kernel-speci c). Detailed requirements for implementation, along with relevant functions and allowed packages have been provided in hw4.R and hw4 checker.R.
-
-
-
Compare the kernels: After getting the predictions for each kernels with hyperparam-eter tuning, you will rst use the plotting functions to visualize the prediction results, and compare these results. Which kernel(s) do you think worked best for this dataset? Report the best kernel you identi ed and your reason in your PDF. Also, you are required to implement a function classification compare accuracy, to calculate the accuracy of every kernels. The details are also included in hw4.R and hw4 checker.R.
-
NOTE: Your entire solution hw4.R should not take more than 2 minutes to run. Any solution taking longer will be awarded a zero.
7