Automated Learning and Data Analysis Solution

$30.00 $24.00

Problems Data Properties (18 points) [Ge Gao] Answer the following questions about attribute types. Classify the following attributes as nominal, ordinal, interval or ratio. Also classify them as binary1, discreet or continuous. If necessary, give a few examples of values that might appear for this attribute to justify your answer. If you make any assumptions…

Rate this product

You’ll get a: zip file solution

 

Description

Rate this product

Problems

  1. Data Properties (18 points) [Ge Gao]

Answer the following questions about attribute types.

    1. Classify the following attributes as nominal, ordinal, interval or ratio. Also classify them as binary1, discreet or continuous. If necessary, give a few examples of values that might appear for this attribute to justify your answer. If you make any assumptions in your answer, you must state them explicitly.

      1. Diastolic blood pressure measured in units of millimeters of mercury

      1. Apartment number (101, 203, 411, etc.)

      1. Species of birds (sparrows, warblers, ducks, etc.)

      1. A record of whether or not a CSC student has attended a required seminar (Yes or No)

      1. Temperature in Kelvin

      1. Number of marbles in a bag

      1. Income ($)

      1. Movie seat number (A1, A2, B1, etc.)

      1. Day of the month

      1. Project group number (G01, G02, G03, etc.)

    1. Table 1 is a dataset with 6 attributes describing students. For each of the following statis-tics/operations, list all of the dataset’s attributes where we can apply that operation: mode, median, Pearson correlation, mean, standard deviation, z-score normalization, binary discretiza-tion (into a \high” and \low” group). If you make any assumptions in your answer, you must state them explicitly.

  • binary attributes are a special case of discreet attributes

1

ADLA { Spring 2020

Homework 1

Last Updated: January 17, 2020

Table 1: Students Dataset

Course

StudentID

GroupID

# of Teammates

Grade

Letter

STAT 501

001

G11

3

92.1

A-

STAT 505

002

G13

3

89.2

B+

STAT 511

005

G02

2

93.6

A

CS 516

007

S03

2

95.0

A

CS 522

202

S03

3

85.3

B

CS 589

203

G02

2

82.4

B-

PSY 501

003

G06

3

78.2

C+

PSY 505

003

S02

3

86.7

B

PSY 516

391

S07

3

93.1

A

PSY 530

226

G08

2

96.2

A

    1. Longitude is a measure of how far East/West your are on the globe, ranging from -180 to 180, with 0 going through Greenwich, England. Give an example of a situation where it would make sense to treat Longitude as an Interval attribute. Then given an example of when it would make sense to consider it as a Ratio attribute. Brie y justify each answer.

  1. Data Transformation and Data Quality (12 points) [Ge Gao]

In a blood test, 3 measures (A1, A2, A3) results were collected for 12 patients. Table 2 shows the measures recorded for each patient after test. NA is used to indicate missing data.

Table 2: Medical Measures

Patient

A1

A2

A3

1

233

6

48

2

229

11

44

3

226

NA

43

4

243

NA

41

5

249

6

38

6

NA

6

NA

7

253

6

39

8

257

7

44

9

251

7

44

10

251

NA

43

11

249

7

41

12

253

6

NA

    1. Evaluate the following strategies for dealing with missing data (NA) from the medical experiment above. Give an advantage and disadvantage of each strategy, and which you would choose. Brie y justify your answers in terms of the data above.

      1. Strategy 1: Remove the patients with any missing values.

      1. Strategy 2: Estimate the value of missing data for an attribute by taking the average value of other participants for that attribute.

    1. Identify a possible outlier in the dataset and justify why it should be considered an outlier. Under what circumstances would it make sense to not consider it an outlier.

  1. Sampling (7 points) [Ge Gao]

    1. State the sampling method used in the following scenarios and give a reason for your answer. Choose from the following options: simple random sample with replacement, simple random sample without replacement, strati ed sampling, progressive/adaptive sampling.

      1. Data is collected in an experiment until a predictive model reaches 90% accuracy.

      1. To learn the average GPAs of students at NC State University, the population was divided into the following groups: Freshman, Sophomore, Junior, and Senior. 5% of students from each group were selected for the study.

2

ADLA { Spring 2020 Homework 1 Last Updated: January 17, 2020

      1. From the following population, f1, 1, 2, 2, 5g, a sample f1, 2, 2, 2, 5g was collected.

    1. The U.S. Congress is made up of 2 chambers: 1) a Senate of 100 members, with 2 members from each state, and 2) a House of Representatives of 435 members, with members from each state proportional to that state’s population. For example, Alaska has 2 Senators and 1 House representative, while Florida has 2 Senators and 27 House representatives. Both the Senate and the House are conducting surveys of their constituents, which they want to re ect the makeup of each chamber. You suggest that they use strati ed sampling for this survey, sending surveys to a certain number of people from each state. Each survey will be sent to 1200 participants.

      1. Why is strati ed sampling appropriate here?

      1. For the Senate survey, how many surveys would you recommend sending to people in Alaska?

      1. For the House survey, how many surveys would you recommend sending to people in Florida?

      1. What are some advantages of the \Senate” approach and the \House” approach to strati ed sampling?

  1. Dimensionality Reduction (12 points) [Ge Gao]

In this problem, you will analyze the PCA results on the BeijingPM2.5 dataset. Figure 1 shows the Eigenvalue Scree plot and the principal components of PCA analysis on the scaled raw dataset. The dataset was then normalized using z-scores, and Figure 2 shows the Eigenvalue Scree plot and the principal components of PCA analysis on dataset after normalization.

Figure 1: PCA1 on Raw Dataset

3

ADLA { Spring 2020 Homework 1 Last Updated: January 17, 2020

Figure 2: PCA2 on Normalized Dataset

Please answer the following questions:

    1. In Figure 1, what is the most reasonable number of principal components to retain? Brie y justify your choice.

    1. Based on the table in Figure 1, do you think that performing PCA was useful? Why or why not? If not, what properties of the dataset caused PCA to be less useful?

    1. In Figure 2, what is the most reasonable number of principal components to retain for dimen-sionality reduction? Brie y justify your choice. Hint: There may be more than one reasonable answer.

    1. If you were to use the results in Figure 2 for feature selection, which of the original attributes would you select? Brie y justify your answer.

    1. Explain the di erence between PCA1 and PCA2. Which one would you use for analysis and why?

  1. Discretization (12 points) [Ge Gao] Consider the following dataset:

4

ADLA { Spring 2020

Homework 1

Last Updated: January 17, 2020

PATIENT

CHLORIDE

POTASSIUM

DATE

NORMAL

1

105

4.1

01/17/2005

yes

2

97

3.8

01/17/2005

yes

3

91

3.2

01/17/2005

no

4

104

4.1

01/18/2005

yes

5

111

5.6

01/18/2005

no

6

108

3.8

01/18/2005

yes

7

95

2.7

01/18/2005

no

8

97

4.6

01/19/2005

yes

9

99

3.9

01/19/2005

yes

10

97

3.5

02/02/2005

yes

11

98

4.7

02/02/2005

yes

12

102

3.7

02/02/2005

yes

13

109

6.0

02/02/2005

no

14

90

6.5

02/04/2005

no

15

103

5.0

02/04/2005

yes

  1. Discretize the attribute CHLORIDE by binning it into 5 equal-width intervals (the range of each interval should be the same). Show your work.

  1. Discretize the attribute POTASSIUM by binning it into 5 equal-depth intervals (the number of items in each interval should be the same). Show your work.

  1. Consider the following new approach to discretizing a numeric attribute: Given the mean (x) and the standard deviation ( ) of the attribute values, bin the attribute values into the following

intervals: [x + (k 1) , x + k ),

for all integer values k, i.e. k = : : : 4; 3; 2; 1; 0; 1; 2 : : :

Assume that the mean of the attribute CHLORIDE above is x = 100 and that the standard deviation = 6. Discretize CHLORIDE using this new approach. Show your work.

    1. For each of the above discretization approaches, explain its advantages and disadvantages and when you would want to use it.

  1. Distance Metrics (14 points) [Yang Shi]

    1. A true distance metric has three properties: a) positive de niteness, b) symmetry, c) triangle inequality. Now consider the following distance functions:

      1. Euclidean distance between two numeric vectors

      1. Hamming distance between two numeric vectors

      1. Cosine distance between two numeric vectors, de ned as 1 minus the cosine similarity: d(A; B) = 1 A B=(jjAjj jjBjj)

For each distance function, describe whether it has each property. If so, give a short explanation of why. If not, give a counter example, including two pairs of items, the distance between them, and how it violates the given property.

  1. CSC522: Required / CSC422: Extra Credit (6 points)

A 1-nearest-neighbor (1-NN) classi er labels a new item y in the test dataset Y by nding the

closest item x in the training dataset X, and returning the label of x.

Assume we have a distance function d that is very expensive to calculate for any d(x; y) where x 2 X and y 2 Y . However, because we can pre-calculate the distance between any two items in X, d(xi; xj) is relatively cheap to calculate for any xi; xj 2 X.

To classify a new item y, our 1-NN algorithm will have have to make jXj comparisons between y and some xi, since it has to compare y to every item xi 2 X to nd y’s closest neighbor. However, if d is a true distance metric, we may be able to reduce the number of comparisons we have to make by skipping some of them.

  1. What property of distance metrics allows us to skip some d(xi; y) comparisons in the 1-NN algorithm?

  1. What strategy could we use to reduce the number of d(xi; y) comparisons? Give one example with values for y, x1, and x2, that illustrates that strategy. (Hint: it may help to draw it out the positions of x1, x2 and y in a 2D space.)

5

ADLA { Spring 2020 Homework 1 Last Updated: January 17, 2020

    1. Does this strategy reduce the number of d(xi; y) comparisons in the best case? What about the worst case?

  1. Similarity, Dissimilarity and Normalization (25 points) [Yang Shi]

R Programming Submission Instructions

Make sure you clearly list each team member’s names and Unity IDs at the top of your submission. Your code should be named hw1:R. Add this le, along with a README to the zip le mentioned

in the rst page.

Failure to follow naming conventions or programming related instructions speci ed below may result in your submission not being graded.

If the instructions are unclear, please post your questions on piazza.

Programming related instructions

Carefully read what the function names have been requested by the instructor. In this homework or the following ones, if your code does not follow the naming format requested by the instructor, you will not receive credit.

For each function, both the input and output formats are provided in the hw1:R. Function calls are speci ed in hw1 checker:R. Please ensure that you follow the correct input and output formats. Once again, if you do not follow the format requested, you will not receive credit. It is clearly stated which functions need to be implemented by you in the comments in hw1:R.

You are free to write your own functions to handle sub-tasks, but the TA will only call the functions he has requested. If the requested functions do not run/return the correct values/do not nish running in speci ed time, you will not receive full credit.

DO NOT set working directory (setwd function) or clear memory (rm(list=ls(all=T))) in your code. TA(s) will do so in their own auto grader.

The TA will have an autograder which will rst run source(hw1.R), then call each of the functions requested in the homework and compare with the correct solution.

Your code should be clearly documented.

To test you code, step through the hw1 checker.R le. If you update you code, make sure to run source(‘./hw1.R’) again to update your function de nitions. You can also check the \Source on save” option in R Studio to do this automatically on save.

You can also check you functions manually by running them in the console with smaller inputs. Calculating the distances usually takes no longer than 20 seconds.

Question

Dataset You are given the following dataset(s):

Iris dataset [1]. You are provided a subset of the Iris dataset. Each line in iris:csv represents a ve element vector, representing a single sample from your dataset. Each value if the rst four columns are the attributes of the sample, respectively “sepal length”, “sepal width”, “petal length”, “petal width”. The fth column is the class values, specifying which class of iris plant the sample is from. In total, there are 60 sample points.

Part 1: Distance Measurement Before doing analysis, you will need to look through the data le, and write a function named read data to read the dataset in as a dataframe.

1a) Using the data provided in iris:csv, you are to implement the distance/similarity measurements de ned below. The inputs will be two vectors of the same length:

6

ADLA { Spring 2020

Homework 1

Last Updated: January 17, 2020

piP

(a) euclidean: euclidean(P , Q) =

i

(Pi

Qi)2, where P and Q are vectors of equal length.

P

P 2.

, where P and Q are vectors of equal length, and jjP jj =

(b) cosine: cosine(P , Q) = 1 –

jjP jj jjQjj

(c) p

P

i

i

i

1

i

i j

j

L

: L

1

P

Q

, where P and Q are vectors of equal length.

(P , Q) = max

1b) Your goal is to investigate how useful each distance function is in telling apart owers of di erent species. Ideally, a distance measure should be large for owers of di erent species, and relatively smaller for ower of the same species.

To help you with this task, we have provided you with a function: inter intra species dist. This function calculates the distance between each ower in the provided iris dataset, using the speci ed distance function. It then averages the following properties for each ower species:

  1. mean intra dis: The average distance between owers of this (same) species.

  1. mean inter dis: The average distance between owers of this species and other (di erent) species.

  1. ratio: The ratio of mean intra dis / mean inter dis

In your PDF report, use this function to answer the follow question: Which of the distance metrics that you implemented is most useful for di erentiating iris species? Why do you think it is most useful?

Part 2: Principal Component Analysis In this part, you will need to implement a function to calculate the principal components (PCs) of a dataset in the function principal component analysis. You are encouraged to leverage the existing function in R, which is prcomp. The input of this function would be an iris dataframe, and you may need to note that the nal column is a nominal value, which cannot included in the calculation of PCA. The output of the function is a vector of the weights (the eigenvector) of rst principal component. Hint: Use ?prcomp for more information on how to use the function.

After calculating the PC, the next step to do write a function, principal component calculation, to calculate a PC value given a data object and the component weights.

Part 3: Principal Component Distance

We want to see whether the rst PC meaningfully captures the di erences between iris species. Imple-ment the pc1 distance function, which takes in two data objects (vectors) and a set of PC weights, and returns the distance between those two vectors in the dimension of the rst PC, i.e. the absolute di erence between their PC values.

Part 4: Comparing Distances

Now we want to compare our PC1 distance to traditional euclidean distance. In your PDF, use the inter intra species dist function to answer the following question: Which of the two distance metrics (euclidean, PC1) is most useful for di erentiating iris species? Why do you think it is most useful?

Note: hw1:R has already been provided for you, with the function de nitions. Complete all the functions requested for in hw1:R. Please note that hw1 checker:R is for you to understand how to run the code with necessary implementations that are not required. DO NOT submit hw1 checker:R. Also, please note that the TA may be using a dataset di erent to yours, so do not hard code your solutions.

Also, it is recommended you read up on vectorized operations in R. Any submission that takes more than 5 minutes to run on a standard university machine (32 GB RAM, i7 processor) will receive a zero grade. Also, please ensure that all the libraries are correctly loaded using the require method.

Allowed Packages: R Base, plyr. No other packages are allowed.

References

  1. R. A. Fisher, \The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179{188, 1936.

7

Automated Learning and Data Analysis Solution
$30.00 $24.00