CSC 4780/6780 Homework 12

$30.00 $24.00

Classifying tumors In the US, about 288,000 cases of breast cancer will be diagnosed this year. The tumors are either malignant (bad) or benign (not bad). The University of Wisconsin has released a dataset with 30 metrics for 570 actual tumors and whether they were malignant or benign. You will use this dataset to develop…

Rate this product

You’ll get a: zip file solution

 

Categorys:

Description

Rate this product

  • Classifying tumors

In the US, about 288,000 cases of breast cancer will be diagnosed this year. The tumors are either malignant (bad) or benign (not bad). The University of Wisconsin has released a dataset with 30 metrics for 570 actual tumors and whether they were malignant or benign.

You will use this dataset to develop a system that predicts whether a tumor is malignant or benign based on these 30 metrics.

Here is where the data is from: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+ Wisconsin+%28Diagnostic%29

You will use a support vector classi er with a non-linear kernel to do this.

  • Training

Create a program called breast train.py that reads in train breast.csv. Separate the inputs and target (diagnosis) into di erent numpy arrays.

Make the output boolean: True if the tumor is malignant.

Make a StandardScaler and t it to the data. Use it to standardize the training data.

Use GridSearchCV to nd the kernel and C that give the best F1 score for sklearn.svm.SVC:

  • Try the following kernels: Linear, Radial Basis, Polynomial, and Sigmoid.

  • Try the following values for: 0.5, 1.0, 2.0, 3.0, 4.0.

Create a new SVC with the best parameters. Fit it to all the training data. Print the amount of time tting took.

Write out the StandardScaler and the SVC to classifier.pkl

Print out the accuracy and confusion matrix for the training data.

It should look something like this when it runs:

> python3 breast_train.py

X shape = (516, 30), y shape=(516,)

Best parameters = {’C’: 3.0, ’kernel’: ’rbf’}

Fitting took 0.003047 seconds with d=30 input.

Accuracy on training data = 98.84%

Confusion on training data:

[[320 0]

  • 6 190]]

  • Testing

Create a program called breast test.py that reads in test breast.csv. Separate the 30 inputs and the 1 output into di erent numpy arrays.

Make the output boolean: True if the tumor is malignant.

Read in StandardScaler and SVC from classifier.py.

(Do not re t the StandardScaler! We are translating/scaling the test data exactly like we did the training data.)

Print out the accuracy and confusion matrix for the testing data. Also make a nice confusion matrix diagram called text confusion.png.

It should look something like this when it runs:

> python3 breast_test.py

X shape = (53, 30), y shape=(53,)

Accuracy on testing data = 98.11%

Confusion on testing data:

[[37 0]

[ 1 15]]

Wrote test_confusion.png

And test confusion.png should look something like this:

  • Reduce dimension with PCA

Maybe the model takes too long to t. (Not in this case, but that is a common reason that PCA is used.)

Let’s use principal component analysis to reduce the dimension of the input from 30 to 11. This will make it easier for the classi er, but the change may damage our accuracy a little.

Duplicate breast train.py and breast test.py to breast train pca.py and breast test pca.py respectively.

In breast train pca.py, compute a W matrix that will reduce the 30 dimensional input to 11 di-mensions using principal component analysis. Train the SVC classi er using only the 11-dimensional data. Write the scaler, W, and the SVC classi er out to pca classifier.pkl.

Do not use sklearn’s PCA. Do it explicitly using matrix operations and numpy.linalg.svd.

When breast train pca.py runs:

> python3 breast_train_pca.py

X shape = (516, 30), y shape=(516,)

Best parameters = {’C’: 2.0, ’kernel’: ’rbf’}

Fitting took 0.002278 seconds with d=11 input.

Accuracy on training data = 98.45%

Confusion on training data:

[[318 2]

  • 6 190]]

Notice that the t happens a little faster.

In breast test pca.py, read the scaler, the W matrix, and the SVC classi er from pca classifier.pkl. Use it to classify the training data. Print the accuracy and confusion matrix. Write out the con-fusion matrix diagram as test confusion pca.png.

When breast train pca.py runs:

> python3 breast_test_pca.py

X shape = (53, 30), y shape=(53,)

Accuracy on testing data = 94.34%

Confusion on testing data:

[[36 1]

[ 2 14]]

Wrote test_confusion_pca.png

Notice that the accuracy is a little lower.

  • Criteria for success

If your name is Fred Jones, you will turn in a zip le called HW12 Jones Fred.zip of a directory called HW12 Jones Fred. It will contain:

  • breast train.py

  • breast test.py

  • classifier.pkl

  • test confusion.png

  • breast train pca.py

  • breast test pca.py

  • pca classifier.pkl

  • test confusion pca.png

  • train breast.csv

  • test breast.csv

Be sure to format your python code with black before you submit it.

We would run your code like this:

cd HW12_Jones_Fred

python3 breast_train.py

python3 breast_test.py

python3 breast_train_pca.py

python3 breast_test_pca.py

Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.

The template les for the python programs have import statements. Do not use any frameworks not in those import statements.

5

CSC 4780/6780 Homework 12
$30.00 $24.00