Description
Problems
-
Data Properties (18 points) [Ge Gao]
Answer the following questions about attribute types.
-
-
Classify the following attributes as nominal, ordinal, interval or ratio. Also classify them as binary1, discreet or continuous. If necessary, give a few examples of values that might appear for this attribute to justify your answer. If you make any assumptions in your answer, you must state them explicitly.
-
-
-
-
Diastolic blood pressure measured in units of millimeters of mercury
-
-
-
-
-
Apartment number (101, 203, 411, etc.)
-
-
-
-
-
Species of birds (sparrows, warblers, ducks, etc.)
-
-
-
-
-
A record of whether or not a CSC student has attended a required seminar (Yes or No)
-
-
-
-
-
Temperature in Kelvin
-
-
-
-
-
Number of marbles in a bag
-
-
-
-
-
Income ($)
-
-
-
-
-
Movie seat number (A1, A2, B1, etc.)
-
-
-
-
-
Day of the month
-
-
-
-
-
Project group number (G01, G02, G03, etc.)
-
-
-
-
Table 1 is a dataset with 6 attributes describing students. For each of the following statis-tics/operations, list all of the dataset’s attributes where we can apply that operation: mode, median, Pearson correlation, mean, standard deviation, z-score normalization, binary discretiza-tion (into a \high” and \low” group). If you make any assumptions in your answer, you must state them explicitly.
-
-
binary attributes are a special case of discreet attributes
1
Homework 1 |
Last Updated: January 17, 2020 |
||||
Table 1: Students Dataset |
|||||
Course |
StudentID |
GroupID |
# of Teammates |
Grade |
Letter |
STAT 501 |
001 |
G11 |
3 |
92.1 |
A- |
STAT 505 |
002 |
G13 |
3 |
89.2 |
B+ |
STAT 511 |
005 |
G02 |
2 |
93.6 |
A |
CS 516 |
007 |
S03 |
2 |
95.0 |
A |
CS 522 |
202 |
S03 |
3 |
85.3 |
B |
CS 589 |
203 |
G02 |
2 |
82.4 |
B- |
PSY 501 |
003 |
G06 |
3 |
78.2 |
C+ |
PSY 505 |
003 |
S02 |
3 |
86.7 |
B |
PSY 516 |
391 |
S07 |
3 |
93.1 |
A |
PSY 530 |
226 |
G08 |
2 |
96.2 |
A |
-
-
Longitude is a measure of how far East/West your are on the globe, ranging from -180 to 180, with 0 going through Greenwich, England. Give an example of a situation where it would make sense to treat Longitude as an Interval attribute. Then given an example of when it would make sense to consider it as a Ratio attribute. Brie y justify each answer.
-
-
Data Transformation and Data Quality (12 points) [Ge Gao]
In a blood test, 3 measures (A1, A2, A3) results were collected for 12 patients. Table 2 shows the measures recorded for each patient after test. NA is used to indicate missing data.
Table 2: Medical Measures
-
Patient
A1
A2
A3
1
233
6
48
2
229
11
44
3
226
NA
43
4
243
NA
41
5
249
6
38
6
NA
6
NA
7
253
6
39
8
257
7
44
9
251
7
44
10
251
NA
43
11
249
7
41
12
253
6
NA
-
-
Evaluate the following strategies for dealing with missing data (NA) from the medical experiment above. Give an advantage and disadvantage of each strategy, and which you would choose. Brie y justify your answers in terms of the data above.
-
-
-
-
Strategy 1: Remove the patients with any missing values.
-
-
-
-
-
Strategy 2: Estimate the value of missing data for an attribute by taking the average value of other participants for that attribute.
-
-
-
-
Identify a possible outlier in the dataset and justify why it should be considered an outlier. Under what circumstances would it make sense to not consider it an outlier.
-
-
Sampling (7 points) [Ge Gao]
-
-
State the sampling method used in the following scenarios and give a reason for your answer. Choose from the following options: simple random sample with replacement, simple random sample without replacement, strati ed sampling, progressive/adaptive sampling.
-
-
-
-
Data is collected in an experiment until a predictive model reaches 90% accuracy.
-
-
-
-
-
To learn the average GPAs of students at NC State University, the population was divided into the following groups: Freshman, Sophomore, Junior, and Senior. 5% of students from each group were selected for the study.
-
-
2
ADLA { Spring 2020 Homework 1 Last Updated: January 17, 2020
-
-
-
From the following population, f1, 1, 2, 2, 5g, a sample f1, 2, 2, 2, 5g was collected.
-
-
-
-
The U.S. Congress is made up of 2 chambers: 1) a Senate of 100 members, with 2 members from each state, and 2) a House of Representatives of 435 members, with members from each state proportional to that state’s population. For example, Alaska has 2 Senators and 1 House representative, while Florida has 2 Senators and 27 House representatives. Both the Senate and the House are conducting surveys of their constituents, which they want to re ect the makeup of each chamber. You suggest that they use strati ed sampling for this survey, sending surveys to a certain number of people from each state. Each survey will be sent to 1200 participants.
-
-
-
-
Why is strati ed sampling appropriate here?
-
-
-
-
-
For the Senate survey, how many surveys would you recommend sending to people in Alaska?
-
-
-
-
-
For the House survey, how many surveys would you recommend sending to people in Florida?
-
-
-
-
-
What are some advantages of the \Senate” approach and the \House” approach to strati ed sampling?
-
-
-
Dimensionality Reduction (12 points) [Ge Gao]
In this problem, you will analyze the PCA results on the BeijingPM2.5 dataset. Figure 1 shows the Eigenvalue Scree plot and the principal components of PCA analysis on the scaled raw dataset. The dataset was then normalized using z-scores, and Figure 2 shows the Eigenvalue Scree plot and the principal components of PCA analysis on dataset after normalization.
Figure 1: PCA1 on Raw Dataset
3
ADLA { Spring 2020 Homework 1 Last Updated: January 17, 2020
Figure 2: PCA2 on Normalized Dataset
Please answer the following questions:
-
-
In Figure 1, what is the most reasonable number of principal components to retain? Brie y justify your choice.
-
-
-
Based on the table in Figure 1, do you think that performing PCA was useful? Why or why not? If not, what properties of the dataset caused PCA to be less useful?
-
-
-
In Figure 2, what is the most reasonable number of principal components to retain for dimen-sionality reduction? Brie y justify your choice. Hint: There may be more than one reasonable answer.
-
-
-
If you were to use the results in Figure 2 for feature selection, which of the original attributes would you select? Brie y justify your answer.
-
-
-
Explain the di erence between PCA1 and PCA2. Which one would you use for analysis and why?
-
-
Discretization (12 points) [Ge Gao] Consider the following dataset:
4
Homework 1 |
Last Updated: January 17, 2020 |
|||
PATIENT |
CHLORIDE |
POTASSIUM |
DATE |
NORMAL |
1 |
105 |
4.1 |
01/17/2005 |
yes |
2 |
97 |
3.8 |
01/17/2005 |
yes |
3 |
91 |
3.2 |
01/17/2005 |
no |
4 |
104 |
4.1 |
01/18/2005 |
yes |
5 |
111 |
5.6 |
01/18/2005 |
no |
6 |
108 |
3.8 |
01/18/2005 |
yes |
7 |
95 |
2.7 |
01/18/2005 |
no |
8 |
97 |
4.6 |
01/19/2005 |
yes |
9 |
99 |
3.9 |
01/19/2005 |
yes |
10 |
97 |
3.5 |
02/02/2005 |
yes |
11 |
98 |
4.7 |
02/02/2005 |
yes |
12 |
102 |
3.7 |
02/02/2005 |
yes |
13 |
109 |
6.0 |
02/02/2005 |
no |
14 |
90 |
6.5 |
02/04/2005 |
no |
15 |
103 |
5.0 |
02/04/2005 |
yes |
-
Discretize the attribute CHLORIDE by binning it into 5 equal-width intervals (the range of each interval should be the same). Show your work.
-
Discretize the attribute POTASSIUM by binning it into 5 equal-depth intervals (the number of items in each interval should be the same). Show your work.
-
Consider the following new approach to discretizing a numeric attribute: Given the mean (x) and the standard deviation ( ) of the attribute values, bin the attribute values into the following
intervals: [x + (k 1) , x + k ),
for all integer values k, i.e. k = : : : 4; 3; 2; 1; 0; 1; 2 : : :
Assume that the mean of the attribute CHLORIDE above is x = 100 and that the standard deviation = 6. Discretize CHLORIDE using this new approach. Show your work.
-
-
For each of the above discretization approaches, explain its advantages and disadvantages and when you would want to use it.
-
-
Distance Metrics (14 points) [Yang Shi]
-
-
A true distance metric has three properties: a) positive de niteness, b) symmetry, c) triangle inequality. Now consider the following distance functions:
-
-
-
-
Euclidean distance between two numeric vectors
-
-
-
-
-
Hamming distance between two numeric vectors
-
-
-
-
-
Cosine distance between two numeric vectors, de ned as 1 minus the cosine similarity: d(A; B) = 1 A B=(jjAjj jjBjj)
-
-
For each distance function, describe whether it has each property. If so, give a short explanation of why. If not, give a counter example, including two pairs of items, the distance between them, and how it violates the given property.
-
CSC522: Required / CSC422: Extra Credit (6 points)
A 1-nearest-neighbor (1-NN) classi er labels a new item y in the test dataset Y by nding the
closest item x in the training dataset X, and returning the label of x.
Assume we have a distance function d that is very expensive to calculate for any d(x; y) where x 2 X and y 2 Y . However, because we can pre-calculate the distance between any two items in X, d(xi; xj) is relatively cheap to calculate for any xi; xj 2 X.
To classify a new item y, our 1-NN algorithm will have have to make jXj comparisons between y and some xi, since it has to compare y to every item xi 2 X to nd y’s closest neighbor. However, if d is a true distance metric, we may be able to reduce the number of comparisons we have to make by skipping some of them.
-
What property of distance metrics allows us to skip some d(xi; y) comparisons in the 1-NN algorithm?
-
What strategy could we use to reduce the number of d(xi; y) comparisons? Give one example with values for y, x1, and x2, that illustrates that strategy. (Hint: it may help to draw it out the positions of x1, x2 and y in a 2D space.)
5
ADLA { Spring 2020 Homework 1 Last Updated: January 17, 2020
-
-
Does this strategy reduce the number of d(xi; y) comparisons in the best case? What about the worst case?
-
-
Similarity, Dissimilarity and Normalization (25 points) [Yang Shi]
R Programming Submission Instructions
Make sure you clearly list each team member’s names and Unity IDs at the top of your submission. Your code should be named hw1:R. Add this le, along with a README to the zip le mentioned
in the rst page.
Failure to follow naming conventions or programming related instructions speci ed below may result in your submission not being graded.
If the instructions are unclear, please post your questions on piazza.
Programming related instructions
Carefully read what the function names have been requested by the instructor. In this homework or the following ones, if your code does not follow the naming format requested by the instructor, you will not receive credit.
For each function, both the input and output formats are provided in the hw1:R. Function calls are speci ed in hw1 checker:R. Please ensure that you follow the correct input and output formats. Once again, if you do not follow the format requested, you will not receive credit. It is clearly stated which functions need to be implemented by you in the comments in hw1:R.
You are free to write your own functions to handle sub-tasks, but the TA will only call the functions he has requested. If the requested functions do not run/return the correct values/do not nish running in speci ed time, you will not receive full credit.
DO NOT set working directory (setwd function) or clear memory (rm(list=ls(all=T))) in your code. TA(s) will do so in their own auto grader.
The TA will have an autograder which will rst run source(hw1.R), then call each of the functions requested in the homework and compare with the correct solution.
Your code should be clearly documented.
To test you code, step through the hw1 checker.R le. If you update you code, make sure to run source(‘./hw1.R’) again to update your function de nitions. You can also check the \Source on save” option in R Studio to do this automatically on save.
You can also check you functions manually by running them in the console with smaller inputs. Calculating the distances usually takes no longer than 20 seconds.
Question
Dataset You are given the following dataset(s):
Iris dataset [1]. You are provided a subset of the Iris dataset. Each line in iris:csv represents a ve element vector, representing a single sample from your dataset. Each value if the rst four columns are the attributes of the sample, respectively “sepal length”, “sepal width”, “petal length”, “petal width”. The fth column is the class values, specifying which class of iris plant the sample is from. In total, there are 60 sample points.
Part 1: Distance Measurement Before doing analysis, you will need to look through the data le, and write a function named read data to read the dataset in as a dataframe.
1a) Using the data provided in iris:csv, you are to implement the distance/similarity measurements de ned below. The inputs will be two vectors of the same length:
6
Homework 1 |
Last Updated: January 17, 2020 |
||||||||||||||||||
piP |
|||||||||||||||||||
(a) euclidean: euclidean(P , Q) = |
i |
(Pi |
Qi)2, where P and Q are vectors of equal length. |
||||||||||||||||
P |
|||||||||||||||||||
P 2. |
, where P and Q are vectors of equal length, and jjP jj = |
||||||||||||||||||
(b) cosine: cosine(P , Q) = 1 – |
|||||||||||||||||||
jjP jj jjQjj |
|||||||||||||||||||
(c) p |
|||||||||||||||||||
P |
i |
i |
i |
||||||||||||||||
1 |
i |
i j |
j |
||||||||||||||||
L |
: L |
1 |
P |
Q |
, where P and Q are vectors of equal length. |
||||||||||||||
(P , Q) = max |
1b) Your goal is to investigate how useful each distance function is in telling apart owers of di erent species. Ideally, a distance measure should be large for owers of di erent species, and relatively smaller for ower of the same species.
To help you with this task, we have provided you with a function: inter intra species dist. This function calculates the distance between each ower in the provided iris dataset, using the speci ed distance function. It then averages the following properties for each ower species:
-
mean intra dis: The average distance between owers of this (same) species.
-
mean inter dis: The average distance between owers of this species and other (di erent) species.
-
ratio: The ratio of mean intra dis / mean inter dis
In your PDF report, use this function to answer the follow question: Which of the distance metrics that you implemented is most useful for di erentiating iris species? Why do you think it is most useful?
Part 2: Principal Component Analysis In this part, you will need to implement a function to calculate the principal components (PCs) of a dataset in the function principal component analysis. You are encouraged to leverage the existing function in R, which is prcomp. The input of this function would be an iris dataframe, and you may need to note that the nal column is a nominal value, which cannot included in the calculation of PCA. The output of the function is a vector of the weights (the eigenvector) of rst principal component. Hint: Use ?prcomp for more information on how to use the function.
After calculating the PC, the next step to do write a function, principal component calculation, to calculate a PC value given a data object and the component weights.
Part 3: Principal Component Distance
We want to see whether the rst PC meaningfully captures the di erences between iris species. Imple-ment the pc1 distance function, which takes in two data objects (vectors) and a set of PC weights, and returns the distance between those two vectors in the dimension of the rst PC, i.e. the absolute di erence between their PC values.
Part 4: Comparing Distances
Now we want to compare our PC1 distance to traditional euclidean distance. In your PDF, use the inter intra species dist function to answer the following question: Which of the two distance metrics (euclidean, PC1) is most useful for di erentiating iris species? Why do you think it is most useful?
Note: hw1:R has already been provided for you, with the function de nitions. Complete all the functions requested for in hw1:R. Please note that hw1 checker:R is for you to understand how to run the code with necessary implementations that are not required. DO NOT submit hw1 checker:R. Also, please note that the TA may be using a dataset di erent to yours, so do not hard code your solutions.
Also, it is recommended you read up on vectorized operations in R. Any submission that takes more than 5 minutes to run on a standard university machine (32 GB RAM, i7 processor) will receive a zero grade. Also, please ensure that all the libraries are correctly loaded using the require method.
Allowed Packages: R Base, plyr. No other packages are allowed.
References
-
R. A. Fisher, \The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179{188, 1936.
7