Name: CSC 4780/6780 Homework 03
SKU: 28569
Price: 30.00 USD
Availability: InStock

Description

Rate this product

What are we doing?

Linear regression is a very common method of making predictions. You should learn both ways of solving for the coe cients: matrix inversion and gradient descent.

You have decided to create a company called “Zillom” that estimates the price that a house will sell for. I have given you a spreadsheet (properties.xlsx) with the features and prices of 519 houses that have sold recently in Cleveland. The rst ve columns are the features you will use to predict prices:

sqft hvac: Indoor square footage

sqft yard: Outdoor square footage

bedrooms: Number of bedrooms in the house

bathrooms: Number of bathrooms in the house

miles to school: Number of children would need to walk to the nearest elementary school

You are going to use this spreadsheet (and linear regression!) to create a formula for predicting the sale price of any house in Cleveland.

(I got Stable Di usion running, and I asked it to make “an Edward Hopper painting of a realtor in front of a modern house” for you. I’ve included three of the images in this document.)

Write programs that do linear regression

You are going to create three python programs:

linreg mi.py uses matrix inversion to come up with the formula.

linreg sckit.py uses scikit-learn to nd the formula.

linreg gd.py uses gradient to converge upon the formula.

All three take the lename of the spreadsheet as an argument:

> python3 linreg_mi.py properties.xlsx

The program will read the excel spreadsheet that has the features of houses and the price they sold for:

(property id will be the index for the dataframe; you ignore it in the calculations.)

All three will nd the hyperplane that minimizes the L2 error for those 519 data points. Each program will output those coe cients as a formula for predicting house prices:

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) +

($-17,421.84 x miles_to_school)}

They will also output the R² score for t. What is R²?

We usually speak of the inputs for a prediction as the matrix X where each row x_i is the input for one data point.

We usually speak of the vector of correct answers (“the ground truth”) as Y where each element y_i is the output for one data point. The mean of Y is usually denoted y.

Your linear regression will create a set of coe cients B. For each input x_i, you can use B to create a prediction y^_i.

The dumbest linear function for estimating would be just the constant function that returned the mean of Y . This would be equivalent to asking “How much will this house sell for?” and getting the answer, “Well, I’m going to ignore all the features of the house, and tell you that the average price of these 519 houses is $603,139.95.” The sum of squared errors for this dumb approach would be

(y_i y)²

i=1

Your predictions y^_i should have a smaller sum of squared errors:

(y_i y^_i)²

i=1

The R² score for a set of predictions is:

		P		n	(y_i	y)		2
	2			i=1
R		= 1		i=1	(y_i	y^_i)
				n			2
			P

If the data is basically linear without much noise, R² will be close to 1: you have good t.

If the data is not linear or very noisy, R² will be close to 0: your t is terrible, about as bad as ignoring all the features and just using the mean.

Steps

You will edit two les util.py and linreg gd.py.

4.1 util.py

You need to write three functions in util.py. When they are done correctly, linreg mi.py and textttlinreg scikit.pywill run unchanged. The three functions are:

read excel data which reads in the excel le and returns X, Y, and labels. Y is a 1-dimentional numpy array containing the last column of the spreadsheet. X is a 2-dimensional numpy array that contains the data in the other columns and the rst column is lled with 1s. labels is a list of strings from the header in the spreadsheet.

format prediction which takes B (the vector of coe cients) and labels that you created in read excel data. Then it returns a string like this:

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) +

($-17,421.84 x miles_to_school)}

score that takes B, X, and Y and returns the R² score.

When util.py is done, you should be able to run linreg mi.py and linreg scikit.py.

4.2 linreg gd.py

In this, you will be using gradient descent to minimize the squared error. The result should be very nearly the same as linreg mi.py.

The features in properties.xlsx are on very di erent scales (2 bathrooms vs 50,000 square foot yards). As a result, converging would take a very, very long time if you don’t rst standardize the features.

The rst step is to nd the mean and the standard deviation of each column of X. Use those to make each feature have a mean of 0 and a standard deviation of 1.

Then start with a guess of zero for all the coe cients. Do the following many times:

Calculate the gradient

Update your guess. (Multiply the gradient by -0.001 and add to the last guess.)

Compute and record the new mean squared error

When the gradient gets small (and thus the changes to the coe cients gets small), stop. It should take a few hundred iterations.

The coe cients that you have calculated are for standardized inputs. Using the means and standard deviations you computed early, adjust them to use unstandardized data. (The math for this is in the next section.)

Calculate and display the R² score.

Plot the mean square error vs. iterations. This will be most interesting if you use log scaling for both the x and y axes. Save it as err.png. Mine looks like this:

4.3 Standardizing and compensating for standardizing

Using the matrix X, you will calculate the vector of means M = [m₁; m₂; : : : ; m_d] and standard

deviations S = [s₁; s₂; : : : ; s_d].

Then you will create a new matrix X⁰ that has normalized each entry. The entries in column j are given by

x⁰	= (x_j m_j ) =s_j =	^xj	^mj
x⁰	= (x_j m_j ) =s_j =
j		s_j	s_j
		s_j	s_j

Now the matrix X⁰ has two nice properties:

The mean of every column is 0.

The standard deviation of every column is 1.

When you use those numbers to do linear regression, you will get a vector B⁰ = [b⁰₀; b⁰₁; b⁰₂; : : : ; b⁰_d] which can be used for predictions like this:

y^ = b⁰₀ + b⁰₁x⁰₁ + : : : + b⁰_dx⁰_d

where the inputs have been standardized using the M and S that you calculated from the training data.

However, we really want the vector B = [b₀; b₁; b₂; : : : ; b_d] so that we can put non-standardized data

[x₁; : : : ; x_d] into the formula

y^ = b₀ + b₁x₁ + : : : + b_dx_d

Using the de nition of x⁰_j from above we have:

y^ = b₀⁰ + b₁⁰	s₁			s₁	+ : : : + b_d⁰	s_j		s_j
		x₁	m₁			x_j	m_j
Expanding and sorting we get:

_y_{^ =}_b₀0	b₁⁰	^m1	: : : b_d⁰	m_j			b₁⁰	b_d⁰
						+		x₁ + : : : +		^xj
		s₁			s_j		s₁		s_j

Thus,

b₀	= b⁰	b⁰	^m1	: : : b⁰	^mj

	0	¹ s₁		^d s_j

and for 0 < j d:

^b⁰

_b_j₌ j

Use those for your nal answer and the R² calculation.

Scikit-Learn

In a job situation, you will use the sklearn implementation 99% of the time. It uses Single Value De-composition and psuedo-inverses, so it is usually faster and more reliable than the matrix inversion approach.

Criteria for success

If your name is Fred Jones, you will turn in a zip le called HW03 Jones Fred.zip of a directory called HW03 Jones Fred. It will contain:

linreg mi.py (You don’t need to edit.)

linreg scikit.py (You don’t need to edit.)

linreg gd.py (Add about 22 lines of code.)

util.py (Add about 20 lines of code.)

err.png

properties.xlsx (You don’t need to edit.)

Be sure to format your python code with black before you submit it.

We will run your code like this:

cd HW03_Jones_Fred

python3 linreg_mi.py properties.xlsx

python3 linreg_scikit.py properties.xlsx

python3 linreg_gd.py properties.xlsx

We expect the following output:

> python3 linreg_mi.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) + ($-17,421.84 x miles_to_school)

R2 = 0.875699

> python3 linreg_scikit.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) + ($-17,421.84 x miles_to_school)

R2 = 0.875699

> python3 linreg_gd.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

Took 352 iterations to converge

predicted price = $32,362.82 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) +

($59,196.55 x bedrooms) + ($9,598.99 x bathrooms) + ($-17,421.85 x miles_to_school)

R2 = 0.875699

You will get 4 points for a well-written util.py that enables linreg mi.py and linreg scikit.py to get the right answer.

You will get 5 points for a well-written linreg gd.py that uses gradient descent and gets approx-imately the same answer as linreg mi.py.

You will get 1 points for a err.png that looks right.

Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.

Extra help

A good video on gradient descent and linear regression: https://youtu.be/sDv4f4s2SB8

CSC 4780/6780 Homework 03

Share this:

Share this:

Description

Share this:

Related products

CSC 4780/6780 Homework 13

CSC 4780/6780 Homework 12

CSC 4780/6780 Homework 10

CSC 4780/6780 Homework 09

CSC 4780/6780 Homework 2