Name: Assignment_8 Solution
SKU: 1459
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

Readings:

The mid-term will assume you have read at least the high points of the below. In “Free Books and Chapters”

Make sure to read/skim “Spark – The Definitive Guide (Excerpts)”
Make sure to read/skim “Spark – Python API – SQL & DataFrames”
Also take a look at the two Spark Cheat Sheets

For this assignment you will be using your Hadoop environment including the pyspark CLI.

Some basic notes:

We will again be using files generated by the program TestDataGen. But even though the files this program generates end is the ‘.txt’ suffix, I want you to treat them as if they were ‘.csv’ files.
In fact, if you like, when you copy them to HDFS you can change their suffixes from ‘.txt’ to ‘.csv’. But this is not necessary to complete the exercises.

Exercise 1)

Step A

Start up a Hadoop cluster as previously, but instead of choosing the “Core Hadoop” configuration chose the “Spark” configuration (see below), otherwise proceed as before.

Step B

Use the TestDataGen program from previous assignments to generate new data files.

Copy the files to the directory “/user/hadoop” in HDFS

Step C

Load the ‘foodratings’ file as a ‘csv’ file into a DataFrame called foodratings. When doing so specify a schema having fields of the following names and types:

Field Name	Field Type
name	String
food1	Integer
food2	Integer
food3	Integer
food4	Integer
placeid	Integer

As the results of this exercise provide the magic number, the code you execute and screen shots of the following commands:

foodratings.printSchema()

foodratings.show(5)

Exercise 2)

Load the ‘foodplaces’ file as a ‘csv’ file into a DataFrame called foodplaces. When doing so specify a schema having fields of the following names and types:

Field Nampee	Field Type
placeid	Integer
placename	String

As the results of this exercise provide the code you execute and screen shots of the following commands:

foodratings.printSchema()

foodratings.show(5)

Exercise 3)

Step A

Step B

Use a SQL query on the table “foodratingsT” to create a new DataFrame called foodratings_ex3a holding records which meet the following condition: food2 < 25 and food4 > 40. Remember, when defining conditions in your code use maximum parentheses.