HW5 Solution

$35.00 $29.00

Random Forest Motivation Ensemble learning is a general technique to combat overfitting, by combining the predictions of many varied models into a single prediction based on their average or majority vote. The motivation of averaging. Consider a set of uncorrelated random variables fYigni=1 with mean and variance 2. Calculate the expectation and variance of their…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Categorys:
Tags:

Description

5/5 – (2 votes)
  • Random Forest Motivation

Ensemble learning is a general technique to combat overfitting, by combining the predictions of many varied models into a single prediction based on their average or majority vote.

  1. The motivation of averaging. Consider a set of uncorrelated random variables fYigni=1 with mean and variance 2. Calculate the expectation and variance of their average. (In the context of ensemble methods, these Yi are analogous to the prediction made by classifier i. )

  1. Ensemble Learning – Bagging. In lecture, we covered bagging (Bootstrap Aggregating). Bagging is a randomized method for creating many di erent learners from the same data set.

Given a training set of size n, generate B random subsamples of size n0 by sampling with replacement. Some points may be chosen multiple times, while some may not be chosen at all. If n0 = n, around 63% are chosen, and the remaining 37% are called out-of-bag (OOB) samples.

    1. Why 63%?

    1. If we use bagging to train our model, How should we choose the hyperparameter B? Recall, B is the number of subsamples, and typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set.

  1. In part (a), we see that averaging reduces variance for uncorrelated classifiers. Real world prediction will of course not be completely uncorrelated, but reducing correlation will generally reduce the final

variance. Reconsider a set of correlate random variables fZign . Suppose 8i , j, Corr(Zi; Z j) = . i=1

  1. Is a random forest of stumps (trees with a single feature split or height 1) a good idea in general? Does the performance of a random forest of stumps depend much on the number of trees? Think about the bias of each individual tree and the bias of the average of all these random stumps.

  • Decision Trees for Classi cation

In this problem, you will implement decision trees and random forests for classification on three datasets:

1) the spam dataset, and 2) a Titanic dataset to predict Titanic survivors. The data is with the assignment.

In lectures, you were given a basic introduction to decision trees and how such trees are trained. You were also introduced to random forests. Feel free to research di erent decision tree techniques online. You do not have to implement boosting, though it might help with Kaggle.

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 2

3.1 Implement Decision Trees

See the Appendix for more information. You are not allowed to use any o -the-shelf decision tree imple-mentation. Some of the datasets are not “cleaned,” i.e., there are missing values, so you can use external libraries for data preprocessing and tree visualization (in fact, we recommend it). Be aware that some of the later questions might require special functionality that you need to implement (e.g., max depth stopping criterion, visualizing the tree, tracing the path of a sample through the tree). You can use any programming language you wish as long as we can read and run your code with minimal e ort. In this part of your writeup, include your decision tree code.

3.2 Implement Random Forests

You are not allowed to use any o -the-shelf random forest implementation. If you architected your code well, this part should be a (relatively) easy encapsulation of the previous part. In this part of your writeup, include your random forest code.

3.3 Describe implementation details

We aren’t looking for an essay; 1–2 sentences per question is enough.

  1. How did you deal with categorical features and missing values?

  1. What was your stopping criterion?

  1. How did you implement random forests?

  1. Did you do anything special to speed up training?

  1. Anything else cool you implemented?

3.4 Performance Evaluation

For each of the 2 datasets, train both a decision tree and random forest and report your training and validation accuracies. You should be reporting 8 numbers (2 datasets 2 classifiers training/validation). In addition, for both datasets, train your best model and submit your predictions to Kaggle. Include your Kaggle display name and your public scores on each dataset. You should be reporting 2 Kaggle scores.

3.5 Writeup Requirements for the Spam Dataset

  1. (Optional) If you use any other features or feature transformations, explain what you did in your report. You may choose to use something like bag-of-words. You can implement any custom feature extraction code in featurize.py, which will save your features to a .mat file.

  1. For your decision tree, and for a data point of your choosing from each class (spam and ham), state the splits (i.e., which feature and which value of that feature to split on) your decision tree made to classify it. An example of what this might look like:

    1. (“viagra”) 2

    1. (“thanks”) < 1

    1. (“nigeria”) 3

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 3

(d) Therefore this email was spam.

    1. (“budget”) 2

    1. (“spreadsheet”) 1

    1. Therefore this email was ham.

  1. Generate a random 80/20 training/validation split. Train decision trees with varying maximum depths (try going from depth = 1 to depth = 40) with all other hyperparameters fixed. Plot your validation accuracies as a function of the depth. Which depth had the highest validation accuracy? Write 1–2 sentences explaining the behavior you observe in your plot. If you find that you need to plot more depths, feel free to do so.

3.6 Writeup Requirements for the Titanic Dataset

Train a very shallow decision tree (for example, a depth 3 tree, although you may choose any depth that looks good) and visualize your tree. Include for each non-leaf node the feature name and the split rule, and include for leaf nodes the class your decision tree would assign. You can use any visualization method you want, from simple printing to an external library; the rcviz library on github works well.

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 4

  • Appendix

Data Processing for Titanic

Here’s a brief overview of the fields in the Titanic dataset. You will need to preprocess the dataset into a form usable by your decision tree code.

  1. survived: the label we want to predict. 1 indicates the person survived, whereas 0 indicates the person died.

  1. pclass: Measure of socioeconomic status. 1 is upper, 2 is middle, 3 is lower.

  1. age: Fractional if less than 1.

  1. sex: Male/female.

  1. sibsp: Number of siblings/spouses aboard the Titanic.

  1. parch: Number of parents/children aboard the Titanic.

  1. ticket: Ticket number.

  1. fare: Fare.

  1. cabin: Cabin number.

  1. embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

You will face two challenges you did not have to deal with in previous datasets:

  1. Categorical variables. Most of the data you’ve dealt with so far has been continuous-valued. Some features in this dataset represent types/categories. Here are two possible ways to deal with categorical variables:

    1. (Easy) In the feature extraction phase, map categories to binary variables. For example suppose feature 2 takes on three possible values: ‘TA’, ‘lecturer’, and ‘professor’. In the data matrix, these categories would be mapped to three binary variables. These would be columns 2, 3, and 4 of the data matrix. Column 2 would be a boolean feature f0; 1g representing the TA category, and so on. In other words, ‘TA’ is represented by [1; 0; 0], ‘lecturer’ is represented by [0; 1; 0], and ‘professor’ is represented by [0; 0; 1]. Note that this expands the number of columns in your data matrix. This is called “vectorizing,” or “one-hot encoding” the categorical feature.

    1. (Hard, but more generalizable) Keep the categories as strings or map the categories to indices (e.g. ‘TA’, ‘lecturer’, ‘professor’ get mapped to 0; 1; 2). Then implement functionality in deci-sion trees to determine split rules based on the subsets of categorical variables that maximize information gain. You cannot treat these as normal continuous-valued features because ordering has no meaning for these categories (the fact that 0 < 1 < 2 has no significance when 0; 1; 2 are discrete categories).

  1. Missing values. Some data points are missing features. In the csv files, these are represented by the value ‘?’. You have three approaches:

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 5

  1. (Easiest) If a data point is missing some features, remove it from the data matrix (this is useful for your first code draft, but your submission must not do this).

  1. (Easy) Infer the value of the feature from all the other values of that feature (e.g., fill it in with the mean, median, or mode of the feature. Think about which of these is the best choice and why).

  1. (Hard, but more powerful). Use k-nearest neighbors to impute feature values based on the nearest neighbors of a data point. In your distance metric you will need to define the distance to a missing value.

  1. (Hardest, but more powerful) Implement within your decision tree functionality to handle miss-ing feature values based on the current node. There are many ways this can be done. You might infer missing values based on the mean/median/mode of the feature values of data points sorted to the current node. Another possibility is assigning probabilities to each possible value of the missing feature, then sorting fractional (weighted) data points to each child (you would need to associate each data point with a weight in the tree).

For Python:

It is recommended you use the following classes to write, read, and process data:

csv.DictReader

sklearn.feature_extraction.DictVectorizer (vectorizing categorical variables)

(There’s also sklearn.preprocessingOneHotEncoder, but it’s much less clean)

sklearn.preprocessing.LabelEncoder

(if you choose to discretize but not vectorize categorical variables) sklearn.preprocessing.Imputer

(for inferring missing feature values in the preprocessing phase)

If you use csv.DictReader, it will automatically parse out the header line in the csv file (first line of the file) and assign values to fields in a dictionary. This can then be consumed by DictVectorizer to binarize categorical variables.

To speed up your work, you might want to store your cleaned features in a file, so that you don’t need to preprocess every time you run your code.

Approximate Expected Performance

For spam, using the base features and a regular decision tree, we got 74:4% testing accuracy. With a random forest, we get around 75% testing accuracy on Titanic. You can get better performance. This is a general ballpark range of what to expect; we will post cuto s on Piazza.

Suggested Architecture

This is a complicated coding project. You should put in some thought about how to structure your program so your decision trees don’t end up as horrific forest fires of technical debt. Here is a rough, optional spec that only covers the barebones decision tree structure. This is only for your benefit—writing clean code will make your life easier, but we won’t grade you on it. There are many di erent ways to implement this.

Your decision trees ideally should have a well-encapsulated interface like this:

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 6

classifier = DecisionTree(params)

classifier.train(train_data, train_labels)

predictions = classifier.predict(test_data)

where train_data and test_data are 2D matrices (rows are data, columns are features).

A decision tree (or DecisionTree) is a binary tree composed of Nodes. You first initialize it with the necessary parameters (which depend on what techniques you implement). As you train your tree, your tree should create and configure Nodes to use for classification and store these nodes internally. Your DecisionTree will store the root node of the resulting tree so you can use it in classification.

Each Node has left and right pointers to its children, which are also nodes, though some (like leaf nodes) won’t have any children. Each node has a split rule that, during classification, tells you when you should continue traversing to the left or to the right child of the node. Leaf nodes, instead of containing a split rule, should simply contain a label of what class to classify a data point as. Leaf nodes can either be a special configuration of regular Nodes or an entirely di erent class.

Node fields:

split_rule: A length 2 tuple that details what feature to split on at a node, as well as the threshold value at which you should split. The former can be encoded as an integer index into your data point’s feature vector.

left: The left child of the current node.

right: The right child of the current node.

label: If this field is set, the Node is a leaf node, and the field contains the label with which you should classify a data point as, assuming you reached this node during your classification tree traver-sal. Typically, the label is the mode of the labels of the training data points arriving at this node.

DecisionTree methods:

entropy(labels): A method that takes in the labels of data stored at a node and compute the entropy for the distribution of the labels.

information_gain(features, labels, threshold): A method that takes in some feature of the data, the labels and a threshold, and compute the information gain of a split using the threshold.

entropy(label): A method that takes in the labels of data stored at a node and compute the entropy (or Gini impurity).

purification(features, labels, threshold): A method that takes in some feature of the data, the labels and a threshold, and compute the drop in entropy (or Gini impurity) of a split using the threshold.

segmenter(data, labels): A method that takes in data and labels. When called, it finds the best split rule for a Node using the entropy measure and input data. There are many di erent types of segmenters you might implement, each with a di erent method of choosing a threshold. The usual method is exhaustively trying lots of di erent threshold values from the data and choosing the combination of split feature and threshold with the lowest entropy value. The final split rule uses the split feature with the lowest entropy value and the threshold chosen by the segmenter. Be careful how you implement this method! Your classifier might train very slowly if you implement this poorly.

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 7

train(data, labels): Grows a decision tree by constructing nodes. Using the entropy and seg-menter methods, it attempts to find a configuration of nodes that best splits the input data. This function figures out the split rules that each node should have and figures out when to stop growing the tree and insert a leaf node. There are many ways to implement this, but eventually your Deci-sionTree should store the root node of the resulting tree so you can use the tree for classification later on. Since the height of your DecisionTree shouldn’t be astronomically large (you may want to cap the height—if you do, the max height would be a hyperparameter), this method is best implemented recursively.

predict(data): Given a data point, traverse the tree to find the best label to classify the data point as. Start at the root node you stored and evaluate split rules at each node as you traverse until you reach a leaf node, then choose that leaf node’s label as your output label.

Random forests can be implemented without code duplication by storing groups of decision trees. You will have to train each tree on di erent subsets of the data (data bagging) and train nodes in each tree on di erent subsets of features (attribute bagging). Most of this functionality should be handled by a random forest class, except attribute bagging, which may need to be implemented in the decision tree class. Hopefully, the spec above gives you a good jumping-o point as you start to implement your decision trees. Again, it’s highly recommended to think through design before coding.

Happy hacking!

B Submission Instructions

Please submit

a PDF write-up containing your answers, plots, and code to Gradescope;

a .zip file of your code and a README explaining how to run your code to Gradescope; and your two CSV files of predictions to Kaggle.

HW5, ‘UCB CS 189, Spring 2020. All Rights Reserved. This may not be publicly shared without explicit permission. 8

HW5 Solution
$35.00 $29.00