Machine Learning Assignment 02 Solution

$30.00 $24.00

In this assignment,you will implement the algorithm for constructing K-D tr ees (covered in lecture 10), and then nd the neighbor sets for some data, along with the nearest neighbor in that set. Note that we will not be doing the more complex task of nding the nearest neighbor in the entire data-set, which involves…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

In this assignment,you will implement the algorithm for constructing K-D tr ees (covered in lecture 10), and then nd the neighbor sets for some data, along with the nearest neighbor in that set. Note that we will not be doing the more complex task of nding the nearest neighbor in the entire data-set, which involves back-tracking up the tree and using bounding-box information. Instead, we will simply be classifying new data according to the median values used in building the tree.

For full points, your program will work as follows:

  1. (5 pts.) The program will execute from the command line, and handle all input/output using standard system IO (in Java, e.g., this is System.in and System.out, and in C++ you would use cin and cout, etc.).

When the program is executed, it is given two command-line arguments, in order:

    1. A data- le from which to read. You can assume the le exists in the same directory as the executing program.

    1. A minimal set-size (this is the parameter S used by the algorithm to decide when to stop splitting data). You can assume this will be a positive integer value.

Note that these arguments are supplied when we execute the code; unlike the prior program, they are not something that the program prompts the user for after it begins to run. As an example, in a Java context, we may execute the program as follows:

java KD_Builder 5d.txt 10

This would cause the program to build a tree using the 5-dimensional data set found in the le 5d.txt, and splitting that data up until each set of points had 10 or fewer members. Note: The assignment folder contains a sample runs of the program so you can see what is expected in this step and later steps; your output should be much the same, if not identical.

  1. (44 pts.) The program will then execute as follows:

    1. The program will read in its data-set from the le given in the rst command-line argument. Each data-set consists of:

      1. An integer value d that gives the dimensionality of the data.

      1. One or more lines, each consisting of d oating-point values v; 0 v < 1, that de ne data-points. No two data-points will be identical, although some may have the same values on some dimensions.

Sample data- les have been given for data with between 2 and 5 dimensions; your code should be able to handle any positive number of dimensions up to 100-dimensional data. (You can write your own code to produce such data-sets if you want to test this.)

  1. Once your program has read in the data, it will construct a K-D tree, using the input parameter S to determine when it has divided the data into small enough subsets. When building this tree, you will follow the algorithm covered in lecture, placing elements that fall at or below a median value on the left-hand branch below a dividing node, and those that fall above the median on the right-hand branch.

1

Note 01: While you may assume that the user gives you a minimal set-size value that is positive (S 1), that value may actually be higher than the number of data-points in your input set. When this happens, the algorithm produces a \tree” consisting of a single root-node, containing all of the data; your code should account for this possibility. An example of where this occurs is in the le testRun_2d_small-0.txt.

Note 02: Depending upon the data, and the value S, it is possible that all data-points in a given subset have the same value for the dimension that is being used to divide that subset. When this occurs, one of the sets X or X+ will be empty; your code should account for this possibility, too.

  1. After constructing the tree, the program will prompt the user as to whether or not it should display a print-out of the nal subsets of data, stored in the leaves of the tree. If the user chooses to display the leaves, each will be printed with the following information:

The path of left (L) or right (R) branches that need to be followed to reach the corresponding leaf-node.

The minimal and maximal points that de ne the bounding-box for all data in that set (that is the tightest points that de ne opposite corners of the region surrounding each data-point).

A listing of each data-point in the set.

Leaves should be printed from left-most to right-most across the tree. See the samples of this output in the four testRun_2d_small-x.txt les.

    1. After printing the tree (or not, if the user decides not to do so), the program will prompt the user as to whether or not to use the tree to classify some data-set. If the user chooses to do so, the program will prompt them for the name of a le. Again, you can assume that the le exists in the same directory as the executing code, and that it contains data of the same dimensionality as that used to build the tree. The program will then take each data-point from the testing le, and nd the root-node containing the subset of its possible nearest neighbors, by going left or right according to the median values.

When that subset is found, it will be printed out. The program will also nd the member of that set that is closest to the given data-point, identifying that neighbor, and printing out the distance to it. (This distance is the usual Euclidean, i.e., the L2 norm.) See the various sample runs for examples of how this ought to be displayed.

Note 01: Because some leaf-nodes may be empty, your code may nd that a new data-point ends up with no nearest neighbors in its set. This must be handled, and indicated. An example of where this occurs is in the le testRun_2d_small-3.txt.

Note 02: While you can assume that there are no duplicate data-points in the input (otherwise some implementations and runs of the algorithm would never terminate cor-rectly), it is possible for the testing data to contain a point identical to one found in the data used to build the tree. When this happens, a point will be identical in value to its nearest neighbor, and the distance between them will be 0. An example of where this occurs is in the le testRun_2d_small-3.txt.

  1. (1 pt.) When done nding nearest neighbors for the testing data, the program should wish the user goodbye, since is always important to be polite.

2

You will hand in a single compressed archive containing the following:

  1. Complete source-code for your program. This can be written in any language you choose, but should follow proper software engineering principles for said language, including meaningful comments and appropriate code style. Code should be in its own folder within the archive, to keep it organized and separate from the rest of the submission.

  1. A README le that provides instructions for how the program can be run. It is expected that this will consist of the command-line instructions necessarily. If you have completed the nal part of the assignment (the optional verbose mode), then your instructions should include the commands needed to activate that mode and see the tree print-out at the end.

Your code will be tested using input les distinct from the ones provided, although they will be in the same basic format.

3

Machine Learning Assignment 02 Solution
$30.00 $24.00