Name: Programming Assignment 3 - Heap and Decision Tree
SKU: 24230
Price: 24.99 USD
Availability: InStock

Description

5/5 – (2 votes)

You are responsible for adhering to the Aggie Honor Code and student rules regarding original work. You must write and submit your own code to Mimir. Plagiarism checks will be used to check the code in addition to a manual review from the grader.

That said, we recognize that there is a certain degree of collaboration and referencing involved in the process of problem solving. In the starter code, there is a provided bibliography.txt where you will track all of the references you use. Use references generally to help think through a problem or find information about an error. This is especially true if you talk with peers: focus on the high-level aspects of the problem. Do not share code! Do not copy code!

Overview

About this Assignment

We have spent some time talking about tree structures and the types of applications they support. Trees are widely used for algorithms that require logarithmic runtime because they store the data down branches which reduce the problem size into subtrees. In the labs, you have implemented a fully-operational binary search tree, and you have also done several exercises involving balancing trees, recursive operations such as traversals, and practicing the operations for certain special trees such as heaps. While we will discuss many more special trees in upcoming lectures and labs, this assignment focuses on heaps and decision trees.

A heap is a tree which is used to implement a priority queue. That is, it stores information in such a way that the maximum or minimum element is always on top of its respective subtree. We have seen examples of min-heaps in class. This assignment requires you to implement a max-heap in order to support the training phase of a decision tree.

A decision tree is a special type of binary tree used to make a decision process. Each node of a decision tree is a decision point. Based on some parameter, we either take the flow direction of the left child or the right child. Decision trees are one popular form of supervised machine learning.

Machine Learning and Decision Trees – A Short Introduction

The general topic of machine learning is outside the scope of this course, but in a nutshell, machine learning is about creating algorithms and structures able to learn from observations and respond “intelligently.” Unsupervised machine learning takes unlabeled data and tries to interpret it by grouping similar observations together. Supervised learning involves providing large amounts of labeled data to an algorithm which is able to learn the relationship between the observations and the outcomes. Supervised learning is widely used today to power a variety of tools like speech recognition, computer vision and image processing, natural language processing, and more. Decision trees are a form of supervised learning that examines the available data and creates decision points that help it best categorize the outcomes.

CSCE 221 – Polsley

Consider the following simple dataset:

Weight	Height	Coat Color	Breed

8.1 lbs	9.1 in	black	Cat

11 lbs	10.1 in	brown	Dog

10.2 lbs	10.5 in	tabby	Cat

9.2 lbs	8.9 in	white	Cat

10.1 lbs	12 in	black	Dog

We might create a very simple decision tree as follows:

This is a small example of applying decision trees to a binary dataset, that is, a set of data with two possible outcomes. Decision trees can support more outcomes than that, but we will focus on binary problems for this assignment, in particular, data for health-based applications.

Decision trees are fairly unsophisticated by themselves. They simply create a hierarchy of choices that describe the data. However, they can become very powerful when used in combination with each other. This grouping of decision trees into a single model is called a random forest, and it is a common type of machine learning classifier (an algorithm which creates a data model to “classify” or identify a particular outcome, e.g., cat vs. dog).

Decision trees are trained using information gain. By looking at the data and considering all possible choices for decision points, the goal is to find a decision that will tell us the most about the problem. In this example, we would want to find an attribute and data point that helps us to separate the most cats from dogs. Height > 10 is an example of a single threshold that can split most data points.

You will not need to code the information gain function for this assignment! It is well beyond the scope of this class, although you can read more about it here. To see a fun discussion and application of information gain, watch 3Blue1Brown’s video analyzing Wordle using information theory.

CSCE 221 – Polsley

Getting Started

In this assignment, you will first implement an array-based heap. Next, you will implement a decision tree. We will leverage concepts from earlier assignments, as well as structures provided by the standard template library (STL) in C++.

Go to the assignment on Mimir and download the starter code.

1. There is a bibliography.txt file where you should track the references you use for this assignment.

1. The following files are fully implemented and will not need to be edited:

1. 1. DecisionLogic.h – header file for the Decision class and helper functions such as the information gain calculator

1. 1. DecisionLogic.cpp – source code which implements the helper functions to read the data into parallel vectors (essentially a table) and can calculate information gain for a specific attribute

1. 1. runner.cpp – a program which loads input data from provided arguments and trains and tests a decision tree

1. The MaxHeap, DNode, and DTree are unimplemented.

There are provided datasets for detecting diabetes and cancer based on a set of readings. You will use these with the runner to test your decision tree.

1. The diabetes dataset comes from a Coursera class¹ and the cancer dataset is the “Wisconsin Breast Cancer Database” from the University of Wisconsin Hospitals, Madison².

1. As an example, once you have a compiled program, you can run it with the data for cancer detection with the following (assuming you named the program runner):

./runner train_cancer.csv test_cancer.csv

Implement the MaxHeap

1. As we discussed in lecture, heaps are tree-based structures which can be built on top of trees or on top of arrays. While arrays and linear objects are generally more work to use as a basis for generic trees, because of the property that they must be complete trees, heaps can be implemented on top of arrays quite easily.

1. You may use vector for your heap. You may also use your own dynamic array or tree-based implementation. In any case, you should make your heap compatible with the function signatures defined in Part A.

https://github.com/adityakumar529/Coursera_Capstone/blob/master/diabetes.csv

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

CSCE 221 – Polsley

Implement the DNode and DTree

1. The decision tree will look quite similar to trees we have seen so far, but it will have highly specialized operations given its sole purpose is to serve as a classifier.

1. The decision tree will need to use the max-heap from the previous step, so you will find it easiest to start with the heap.

We are not requiring a makefile for this assignment since it is straightforward to compile the sources together, unlike in the previous assignment where we had intermediate build operations to generate the library. That said, you are encouraged to use a makefile to practice building them.

Part A: MaxHeap

Task

Implement a templated max-heap in MaxHeap.h. Your max-heap will ultimately be used for tracking the maximum information gain used by the decision tree, but we will test it independently as a fully functioning implementation of a generalized heap. Your heap only needs a single template type, not a key-value pair, since we’ll be using a separate object to hold the key-value pair together.

You must implement the following functions inside the MaxHeap class. For the purposes of testing, make each of these functions public!

MaxHeap Class Function Signatures	Description

MaxHeap()	Default constructor

MaxHeap(vector<T>)	Constructor accepts a vector of generic objects in any
MaxHeap(vector<T>)	order and converts it into a valid max-heap via heapify
	order and converts it into a valid max-heap via heapify

int getParent(int) const	Given the index of the current node, returns the index of
int getParent(int) const	the parent node
	the parent node

int getLeft(int) const	Given the index of the current node, returns the index of
int getLeft(int) const	the left child node
	the left child node

int getRight(int) const	Given the index of the current node, returns the index of
int getRight(int) const	the right child node
	the right child node

bool empty() const	Boolean indicates if the heap is empty or not

void upheap(int)	Given the index of a node, propagate it up the max heap if
void upheap(int)	the parent is a smaller value
	the parent is a smaller value

void downheap(int)	Given the index of a node, propagate it down the max
void downheap(int)	heap if a child has a larger value
	heap if a child has a larger value

void heapify()	Heapify the current structure to ensure the rules of a heap
void heapify()	are maintained; used in conjunction with the
	parameterized constructor that accepts a vector

CSCE 221 – Polsley

	the tree and to use for printing. Root is at depth = 0; its
	children at depth = 1; and so on.

DNode* parent	Pointer to parent

DNode* left	Pointer to left child

DNode* right	Pointer to right child

Next, implement the data and function members for the DTree class in DTree.h. Again, large amounts of logic have been provided for you through the already-implemented code. However, the DTree class is not without some challenges. It is very similar to the trees we have seen before, including some concepts from labs, but it is slightly unique to this application.

DTree Class Data Member	Description

DNode* root	Root of the tree

int maxDepth = 10	Default max depth of the tree

	A very small delta and its default value, used for
double delta = 0.000005	comparing against the information gain to determine if too
double delta = 0.000005	little has been learned from this level and we should not
	little has been learned from this level and we should not
	branch any further

DTree Class Function Signatures	Description

	This function iterates through the attributes vector and
	calls the provided getInformationGain() for each attribute.
	If you are passing the attribute in attributes.at(i), you will
	pass its corresponding data from data.at(i) to the
Decision getMaxInformationGain(	getInformationGain() function, along with the set of
Decision getMaxInformationGain(	outcomes and instances being considered.
vector<string>& attributes,	outcomes and instances being considered.
vector< vector<double> >& data,
vector<int>& outcomes,	This function must utilize your max-heap to save the
vector<int>& instances)	Decisions and return the one with the max information
	Decisions and return the one with the max information
	gain at the end!
	Notice that the Decision class overloads several
	comparison operators so it works with the heap.

void train(	This function takes the set of attributes, all data, and the
void train(	list of outcomes and instances so that it can build the
vector<string>& attributes,	list of outcomes and instances so that it can build the
vector< vector<double> >& data,	decision tree. It should call getMaxInformationGain to get
vector<int>& outcomes,	the Decision that should be turned into the next node.
vector<int>& instances)

CSCE 221 – Polsley

	You may find it easiest to create a recursive helper: first
	populate the root from the decision with the max gain and
	then call a recursive helper to populate the left and right
	children.
	A Decision keeps track of the instances that should be
	considered at each level. Left children use the
	“instancesBelow” list for instances, and right children use
	the “instancesAbove” list. Remember to track depth and
	parent when creating a new node (you can add these as
	parameters to your recursive helper!)
	You stop adding new children in a branch when you either
	reach maxDepth or the information gain of the max
	decision is below the delta.

	Given a single instance, which is a list of attributes and
	their corresponding data points in parallel vectors, your
	goal is to return the decision tree’s guess for the likely class
	or outcome. The return is integer since we are using
	binary class labels of 0 and 1.
	The classify operation is very similar to search for a binary
	search tree.
	– You start at the root and check the Decision’s
int classify(	attribute label. You scan the attributes vector until
int classify(	you find the correct label.
vector<string>& attributes,	you find the correct label.
vector<string>& attributes,	– Next, you compare the data for that attribute (at
vector<double>& data)	– Next, you compare the data for that attribute (at
vector<double>& data)	the same index in the parallel data vector) against
	the same index in the parallel data vector) against
	the Decision point’s threshold.
	– If the recorded data is less than or equal to the
	threshold, you go down the left subtree. Else, you
	go down the right subtree.
	– This operation continues until you reach the
	bottom of the branch (reach nullptr)
	Classify can be done recursively but is also quite easy to do
	iteratively since it is like a linked list traversal.

	Finally, in order to understand the structure the tree has
	built, we can use a level-order traversal. In class, we went
	through an example of this algorithm with detailed
	pseudocode. You can implement it exactly from that
string levelOrderTraversal()	example with one modification to how we print the data.
	Add the following line in your function when the current
	node is dequeued so that you can print it with an
	indentation level corresponding to its depth:

CSCE 221 – Polsley

Interesting Links

In addition to using our heap to support the decision tree, heaps can be useful for other machine learning algorithms as well. For instance, a popular unsupervised learning model, k-Nearest Neighbors (kNN), works by grouping data points into clusters based on their proximity. Sorting algorithms are useful for this operation, and quicksort is among the fastest and most favored methods. However, since portions of heapsort can be performed in parallel across multiple processors, heapsort has been shown to be a faster method for certain kNN implementations.

For further reading about k-Nearest Neighbors, look at this Towards Data Science article (which is in fact the original source of our diabetes dataset for this problem). Also look at this research paper on fast heapsort-based kNNs. You will see many of the concepts we have been discussing in this class, such as sorting algorithms, Big-O complexity, and more, and you can see an example of how they are used for practical purposes.

9 of 9

Programming Assignment 3 – Heap and Decision Tree

Share this:

Share this:

Description

Share this:

Related products

Homework 5: Heap ADT using STL Solution

Assignment #2 Solution

Exercise 5: Regularized Linear Regression and Bias v.s. Variance Solution

Assignment-7 Binary Search Trees II:Solution

Assignment 1 C++ FUNDAMENTALS Solution