Description
Problem 1: Prediction.
Use Hitters Dataset provided in Home work 2
-
[Code] Fill in the function read data, that takes in the filename as string, and returns a pandas dataframe. Hint: you may find read csv function from pandas library useful
-
[Code] Fill in the function data preprocess to complete all preprocessing steps mentioned in Problem 2 of Homework 2. You may use the same code you written in homework 2. Note: Do not include the Player column.
-
[Code] Fill in the function data split that given features and labels split the data into a train/test split (in 80% of training and 20% of test ratio). The function returns 4 items in the following order x train, x test, y train, y test.
-
[Code + Written] Fill in the function train ridge regression to implement a Ridge Regression Model for Hitters dataset that returns a dictionary with lambda vals as keys and corresponding mean accuracy as value pair. Repeat the training process for n times and train model each time for max iter iterations with all lambda vals (Hyper Tuning). Please describe your hyperparameter tuning procedures and optimal lambda in lambda val that gives highest accuracy.
-
[Code + Written] Fill in the function train lasso to implement a Lasso Regularization Logistic Regres-sion for Hitters dataset that returns a dictionary with lambda vals as keys and corresponding mean accuracy as value pair. Repeat the training process for n times and train each time for max iter iter-ations with all lambda vals (Hyper Tuning). Please describe your hyperparameter tuning procedures and optimal lambda in lambda val that gives highest accuracy.
-
[Code] Fill in the function ridge coefficientsthat returns a tuple of trained ridge model with max iter iterations and alpha being optimal lambda vals and model coefficients.
-
[Code] Fill in the function lasso coefficientsthat returns a tuple of trained lasso model with max iter iterations and alpha being optimal lambda vals and model coefficients.
-
[Code + Written] Fill in the function ridge area under curve that returns area under curve measure-ments. Plot the ROC curve of the Ridge Model. Include axes labels, legend and title in the Plot. Any of the missing items in plot will result in loss of points.
-
[Code + Written] Fill in the function lasso area under curve that returns area under curve measure-ments. Plot the ROC curve of the Lasso Model. Include axes labels, legend and title in the Plot. Any of the missing items in plot will result in loss of points.
Problem 2: Decision Trees.
For this problem, you’ll be coding up regression and classification trees from scratch. Trees are a special class of graphs with only directed edges sans any cycles. They fall under the category of directed acyclic graphs or DAGs. So, trees are DAGs where each child node has only one parent node.
Since trees are easy to design recursively, it is super important that you’re familiar with recursion. So, it is highly recommended that you brush up on recursion and tree-based search algorithms such as depth-first search (BFS) and breadth-first search (BFS).
-
You are NOT allowed to use machine learning libraries such as scikit-learn to build regression and classification trees for this assignment.
-
You are required to fill out sections of the code marked”YOUR CODE HERE”.
-
Download the datasets noisy sin subsample 2.csv here1 and new circle data.csv here2.
-
You may add any number of additional supporting functions within functions that you deem neces-sary.
-
Use Node class as a node of the decision trees. DO NOT change the class and function definitions. Below is a suggested sequence of steps you may want to think along for building regression and classi-
fication trees.
-
-
Defining a criteria for splitting.
-
-
-
-
This criteria assigns a score to a split.
-
-
-
-
-
For regression trees, this would be the mean squared error.
-
-
-
-
-
For decision trees, this would be the Gini index or entropy.
-
-
-
-
Create the split.
-
-
-
-
Split the dataset by iterating over all the rows and feature columns.
-
-
-
-
-
Evaluate all the splits using the splitting criteria.
-
-
-
-
-
Choose the best split.
-
-
-
-
Build the tree.
-
-
-
-
Terminal nodes: decide when to stop growing a tree. This would be the maximum allowed depth of the tree or when a leaf is empty or has only 1 element.
-
-
-
-
-
Recursive splitting: once a split is created, you can split it further recursively by calling the same splitting function on it.
-
-
-
-
-
Building a tree: create a root node and apply recursive splitting on it.
-
-
-
-
Make predictions with the tree.
-
-
-
-
For a given data point, make a prediction using the tree.
-
-
-
Growing a maximum-depth regression tree
The recursive procedure for growing a deep regression tree is illustrated in the figure below. We begin (on the left) by fitting a stump to the original dataset. As we move from left to right the recursion proceeds, with each leaf of the preceding tree split in order to create the next, deeper tree. As can be seen in the
1https://github.com/jermwatt/machine_learning_refined/blob/gh-pages/mlrefined_datasets/ nonlinear_superlearn_datasets/noisy_sin_subsample_2.csv
2https://github.com/jermwatt/machine_learning_refined/blob/gh-pages/mlrefined_datasets/ nonlinear_superlearn_datasets/new_circle_data.csv
rightmost panel, a tree with maximum depth of four is capable of representing the training data perfectly.
Fill in the functions marked with YOUR CODE HERE in TreeRegressor class.
Figure 1: Regressor
(2) Growing a two-class classification tree
The figure above shows the growth of a tree to a maximum depth of seven on a two-class classification dataset. As the tree grows, note how many parts of the input space do not change as leaves on the deeper branches become pure. By the time we reach a maximum depth of seven, there is considerable overfitting. Fill in the functions marked with YOUR CODE HERE in TreeClassifier class
Note: function definitions and comments for each function provide a description of the problems the functions are supposed to address.
Figure 2: Classifier
4