Name: CSC 4780/6780 Homework 4
SKU: 28571
Price: 30.00 USD
Availability: InStock

Description

Rate this product

Read

At this point you should have read up to the start of Chapter 8: Probability, Distributions, and Sampling.

Scrape a webpage

(6 points) Create a python program called scrape.py that takes a date in ISO format as an argument:

> python3 scrape.py 2022-10-02 result.xlsx

The program will then create and excel spreadsheet that lists the names of events that will happen on that date and their urls. It should look like this when you open it in Excel:

Behind the scenes, your program will

fetch the web page at https://discoveratlanta.com/events/all/

parse the result using BeautifulSoup and html.parser

step through each article inspecting the dates of the events

skip articles that do not contain the desired date

for articles that have the desired date, note the title and the URL

make a dataframe with all the titles and URLs

write the dataframe to an ExcelWriter

resize the columns to be a reasonable width

write it to the le named on the command line

You are putting data into only 2 columns { Don’t include the dataframe’s index in the excel le.

Analyze the residual from the last exercise

(4 points) My solution to last week’s regression problem (linreg scikit.py and util.py) are in this directory. Extended it to save a histogram of the residual as res hist.png.

Extended linreg scikit.py again to use scipy’s kstest to con rm that the residual really resem-bles a normal distribution. The test returns a P-value; if the P-value is less than 0.05, you can assume the residual is normally distributed.

Now that you know it is a normal distribution, extend linreg scikit.py yet again to print your con dence like this “68% of the estimates done with this formula will be within $89.12 of the correct price. 95% will be within $140.19 of the correct price.”

What to turn in

If your name is Fred Jones, you will turn in a zip le called HW04 Jones Fred.zip of a directory called HW04 Jones Fred. It will contain:

scrape.py

result.xlsx

linreg scikit.py

util.py

properties.xlsx

res hist.png

Be sure to format your python code with black before you submit it.

We will run your code like this:

cd HW04_Jones_Fred

python3 scrape.py 2022-10-02 result.xlsx

python3 linreg_scikit.py properties.xlsx

The output from the second program should look like this:

> python3 linreg_scikit.py properties.xlsx

Read 519 rows, 5 features from ’properties.xlsx’.

predicted price = $32,362.85 + ($85.61 x sqft_hvac) + ($2.73 x sqft_yard) + ($59,195.07 x bedrooms) + ($9,599.24 x bathrooms) +

($-17,421.84 x miles_to_school) Kolmogorov-Smirnov: P-value = 4.154181404788638e-129

The residual follows a normal distribution.

68% of predictions with this formula will be within $91,849.54 of the actual price.

95% of predictions with this formula will be within $183,699.08 of the actual price.

And should generate a histogram like this:

Do this work by yourself. Stackover ow is OK. A hint from another student is OK. Looking at another student’s code is not OK.

Extra help

Here is a nice tutorial on Beautiful Soup: https://youtu.be/87Gx3U0BDlo

Getting ahead: Soon we will be doing classi cation. Here is a good discussion of metrics for the

quality of a classi er: https://www.youtube.com/watch?v=8d3JbbSj-I8

CSC 4780/6780 Homework 4

Share this:

Share this:

Description

Share this:

Related products

CSC 4780/6780 Homework 13

CSC 4780/6780 Homework 11

CSC 4780/6780 Homework 07

CSC 4780/6780 Homework 05

CSC 4780/6780 Homework 2