Description
You are responsible for adhering to the Aggie Honor Code and student rules regarding original work. You must write and submit your own code to Mimir. Plagiarism checks will be used to check the code in addition to a manual review from the grader.
That said, we recognize that there is a certain degree of collaboration and referencing involved in the process of problem solving. In the starter code, there is a provided bibliography.txt where you will track all of the references you use. Use references generally to help think through a problem or find information about an error. This is especially true if you talk with peers: focus on the high-level aspects of the problem. Do not share code! Do not copy code!
Overview
About this Assignment
We have spent some time talking about tree structures and the types of applications they support. Trees are widely used for algorithms that require logarithmic runtime because they store the data down branches which reduce the problem size into subtrees. In the labs, you have implemented a fully-operational binary search tree, and you have also done several exercises involving balancing trees, recursive operations such as traversals, and practicing the operations for certain special trees such as heaps. While we will discuss many more special trees in upcoming lectures and labs, this assignment focuses on heaps and decision trees.
A heap is a tree which is used to implement a priority queue. That is, it stores information in such a way that the maximum or minimum element is always on top of its respective subtree. We have seen examples of min-heaps in class. This assignment requires you to implement a max-heap in order to support the training phase of a decision tree.
A decision tree is a special type of binary tree used to make a decision process. Each node of a decision tree is a decision point. Based on some parameter, we either take the flow direction of the left child or the right child. Decision trees are one popular form of supervised machine learning.
Machine Learning and Decision Trees – A Short Introduction
The general topic of machine learning is outside the scope of this course, but in a nutshell, machine learning is about creating algorithms and structures able to learn from observations and respond “intelligently.” Unsupervised machine learning takes unlabeled data and tries to interpret it by grouping similar observations together. Supervised learning involves providing large amounts of labeled data to an algorithm which is able to learn the relationship between the observations and the outcomes. Supervised learning is widely used today to power a variety of tools like speech recognition, computer vision and image processing, natural language processing, and more. Decision trees are a form of supervised learning that examines the available data and creates decision points that help it best categorize the outcomes.
Consider the following simple dataset:
Weight |
Height |
Coat Color |
Breed |
8.1 lbs |
9.1 in |
black |
Cat |
11 lbs |
10.1 in |
brown |
Dog |
10.2 lbs |
10.5 in |
tabby |
Cat |
9.2 lbs |
8.9 in |
white |
Cat |
10.1 lbs |
12 in |
black |
Dog |
We might create a very simple decision tree as follows:
This is a small example of applying decision trees to a binary dataset, that is, a set of data with two possible outcomes. Decision trees can support more outcomes than that, but we will focus on binary problems for this assignment, in particular, data for health-based applications.
Decision trees are fairly unsophisticated by themselves. They simply create a hierarchy of choices that describe the data. However, they can become very powerful when used in combination with each other. This grouping of decision trees into a single model is called a random forest, and it is a common type of machine learning classifier (an algorithm which creates a data model to “classify” or identify a particular outcome, e.g., cat vs. dog).
Decision trees are trained using information gain. By looking at the data and considering all possible choices for decision points, the goal is to find a decision that will tell us the most about the problem. In this example, we would want to find an attribute and data point that helps us to separate the most cats from dogs. Height > 10 is an example of a single threshold that can split most data points.
You will not need to code the information gain function for this assignment! It is well beyond the scope of this class, although you can read more about it here. To see a fun discussion and application of information gain, watch 3Blue1Brown’s video analyzing Wordle using information theory.
Getting Started
In this assignment, you will first implement an array-based heap. Next, you will implement a decision tree. We will leverage concepts from earlier assignments, as well as structures provided by the standard template library (STL) in C++.
-
Go to the assignment on Mimir and download the starter code.
-
-
There is a bibliography.txt file where you should track the references you use for this assignment.
-
-
-
The following files are fully implemented and will not need to be edited:
-
-
-
-
DecisionLogic.h – header file for the Decision class and helper functions such as the information gain calculator
-
-
-
-
-
DecisionLogic.cpp – source code which implements the helper functions to read the data into parallel vectors (essentially a table) and can calculate information gain for a specific attribute
-
-
-
-
-
runner.cpp – a program which loads input data from provided arguments and trains and tests a decision tree
-
-
-
-
The MaxHeap, DNode, and DTree are unimplemented.
-
-
There are provided datasets for detecting diabetes and cancer based on a set of readings. You will use these with the runner to test your decision tree.
-
-
The diabetes dataset comes from a Coursera class1 and the cancer dataset is the “Wisconsin Breast Cancer Database” from the University of Wisconsin Hospitals, Madison2.
-
-
-
As an example, once you have a compiled program, you can run it with the data for cancer detection with the following (assuming you named the program runner):
-
./runner train_cancer.csv test_cancer.csv
-
Implement the MaxHeap
-
-
As we discussed in lecture, heaps are tree-based structures which can be built on top of trees or on top of arrays. While arrays and linear objects are generally more work to use as a basis for generic trees, because of the property that they must be complete trees, heaps can be implemented on top of arrays quite easily.
-
-
-
You may use vector for your heap. You may also use your own dynamic array or tree-based implementation. In any case, you should make your heap compatible with the function signatures defined in Part A.
-
-
Implement the DNode and DTree
-
-
The decision tree will look quite similar to trees we have seen so far, but it will have highly specialized operations given its sole purpose is to serve as a classifier.
-
-
-
The decision tree will need to use the max-heap from the previous step, so you will find it easiest to start with the heap.
-
We are not requiring a makefile for this assignment since it is straightforward to compile the sources together, unlike in the previous assignment where we had intermediate build operations to generate the library. That said, you are encouraged to use a makefile to practice building them.
Part A: MaxHeap
Task
Implement a templated max-heap in MaxHeap.h. Your max-heap will ultimately be used for tracking the maximum information gain used by the decision tree, but we will test it independently as a fully functioning implementation of a generalized heap. Your heap only needs a single template type, not a key-value pair, since we’ll be using a separate object to hold the key-value pair together.
You must implement the following functions inside the MaxHeap class. For the purposes of testing, make each of these functions public!
MaxHeap Class Function Signatures |
Description |
MaxHeap() |
Default constructor |
MaxHeap(vector<T>) |
Constructor accepts a vector of generic objects in any |
order and converts it into a valid max-heap via heapify |
|
int getParent(int) const |
Given the index of the current node, returns the index of |
the parent node |
|
int getLeft(int) const |
Given the index of the current node, returns the index of |
the left child node |
|
int getRight(int) const |
Given the index of the current node, returns the index of |
the right child node |
|
bool empty() const |
Boolean indicates if the heap is empty or not |
void upheap(int) |
Given the index of a node, propagate it up the max heap if |
the parent is a smaller value |
|
void downheap(int) |
Given the index of a node, propagate it down the max |
heap if a child has a larger value |
|
void heapify() |
Heapify the current structure to ensure the rules of a heap |
are maintained; used in conjunction with the |
|
parameterized constructor that accepts a vector |
|
the tree and to use for printing. Root is at depth = 0; its |
|
children at depth = 1; and so on. |
|
DNode* parent |
Pointer to parent |
DNode* left |
Pointer to left child |
DNode* right |
Pointer to right child |
Next, implement the data and function members for the DTree class in DTree.h. Again, large amounts of logic have been provided for you through the already-implemented code. However, the DTree class is not without some challenges. It is very similar to the trees we have seen before, including some concepts from labs, but it is slightly unique to this application.
DTree Class Data Member |
Description |
DNode* root |
Root of the tree |
int maxDepth = 10 |
Default max depth of the tree |
A very small delta and its default value, used for |
|
double delta = 0.000005 |
comparing against the information gain to determine if too |
little has been learned from this level and we should not |
|
branch any further |
|
DTree Class Function Signatures |
Description |
This function iterates through the attributes vector and |
|
calls the provided getInformationGain() for each attribute. |
|
If you are passing the attribute in attributes.at(i), you will |
|
pass its corresponding data from data.at(i) to the |
|
Decision getMaxInformationGain( |
getInformationGain() function, along with the set of |
outcomes and instances being considered. |
|
vector<string>& attributes, |
|
vector< vector<double> >& data, |
|
vector<int>& outcomes, |
This function must utilize your max-heap to save the |
vector<int>& instances) |
Decisions and return the one with the max information |
gain at the end! |
|
Notice that the Decision class overloads several |
|
comparison operators so it works with the heap. |
|
void train( |
This function takes the set of attributes, all data, and the |
list of outcomes and instances so that it can build the |
|
vector<string>& attributes, |
|
vector< vector<double> >& data, |
decision tree. It should call getMaxInformationGain to get |
vector<int>& outcomes, |
the Decision that should be turned into the next node. |
vector<int>& instances) |
|
You may find it easiest to create a recursive helper: first |
|
populate the root from the decision with the max gain and |
|
then call a recursive helper to populate the left and right |
|
children. |
|
A Decision keeps track of the instances that should be |
|
considered at each level. Left children use the |
|
“instancesBelow” list for instances, and right children use |
|
the “instancesAbove” list. Remember to track depth and |
|
parent when creating a new node (you can add these as |
|
parameters to your recursive helper!) |
|
You stop adding new children in a branch when you either |
|
reach maxDepth or the information gain of the max |
|
decision is below the delta. |
|
Given a single instance, which is a list of attributes and |
|
their corresponding data points in parallel vectors, your |
|
goal is to return the decision tree’s guess for the likely class |
|
or outcome. The return is integer since we are using |
|
binary class labels of 0 and 1. |
|
The classify operation is very similar to search for a binary |
|
search tree. |
|
– You start at the root and check the Decision’s |
|
int classify( |
attribute label. You scan the attributes vector until |
you find the correct label. |
|
vector<string>& attributes, |
|
– Next, you compare the data for that attribute (at |
|
vector<double>& data) |
|
the same index in the parallel data vector) against |
|
the Decision point’s threshold. |
|
– If the recorded data is less than or equal to the |
|
threshold, you go down the left subtree. Else, you |
|
go down the right subtree. |
|
– This operation continues until you reach the |
|
bottom of the branch (reach nullptr) |
|
Classify can be done recursively but is also quite easy to do |
|
iteratively since it is like a linked list traversal. |
|
Finally, in order to understand the structure the tree has |
|
built, we can use a level-order traversal. In class, we went |
|
through an example of this algorithm with detailed |
|
pseudocode. You can implement it exactly from that |
|
string levelOrderTraversal() |
example with one modification to how we print the data. |
Add the following line in your function when the current |
|
node is dequeued so that you can print it with an |
|
indentation level corresponding to its depth: |
|
Interesting Links
In addition to using our heap to support the decision tree, heaps can be useful for other machine learning algorithms as well. For instance, a popular unsupervised learning model, k-Nearest Neighbors (kNN), works by grouping data points into clusters based on their proximity. Sorting algorithms are useful for this operation, and quicksort is among the fastest and most favored methods. However, since portions of heapsort can be performed in parallel across multiple processors, heapsort has been shown to be a faster method for certain kNN implementations.
For further reading about k-Nearest Neighbors, look at this Towards Data Science article (which is in fact the original source of our diabetes dataset for this problem). Also look at this research paper on fast heapsort-based kNNs. You will see many of the concepts we have been discussing in this class, such as sorting algorithms, Big-O complexity, and more, and you can see an example of how they are used for practical purposes.
9 of 9