Assignment #3 Solution

$30.00 $24.00

ns are not accepted and will result in a late penalty of 10% deductions / day in the assignment. Disclaimer: This assignment requires students to work on Spark framework for unstructured data processing, MongoDb for data storing, and Neo4j graph database for visualization. Submissions related to this assignment will not be used for commercial purposes.…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)

ns are not accepted and will result in a late penalty of 10% deductions / day in the assignment.

Disclaimer: This assignment requires students to work on Spark framework for unstructured data processing, MongoDb for data storing, and Neo4j graph database for visualization. Submissions related to this assignment will not be used for commercial purposes.

Objective:

  • The objective of this assignment is to understand Big Data processing problems, and NoSQL database (document, and graph).

Plagiarism Policy:

  • This assignment is an individual task. Collaboration of any type amounts to a violation of the academic integrity policy and will be reported to the AIO.

  • Content cannot be copied verbatim from any source(s). Please understand the concept and write in your own words. In addition, cite the actual source. Failing to do so will be considered as plagiarism and/or cheating.

  • The Dalhousie Academic Integrity policy applies to all material submitted as part of this course. Please understand the policy, which is available at: https://www.dal.ca/dept/university_secretariat/academic-integrity.html

Assignment Rubric

Excellent

Proficient (15%)

Marginal (5%)

Unacceptable

This Rubric

(25%)

(0%)

Applied to

Completeness

All required

Submission

Some tasks are

Incorrect and

including

tasks are

highlights tasks

completed,

irrelevant

Citation

completed

completion.

which are

However, missed

disjoint in

Problem #2

some tasks in

nature.

between, which

created a

disconnection

Correctness

All parts of the

Most of the given

Most of the

Incorrect and

given tasks are

tasks are correct

given tasks are

unacceptable

Problem #1

correct

However, some

incorrect. The

portions need

submission

Summer 2021

saurabh.dey@dal.ca

minor

requires major

modifications

modifications.

Novelty

The submission

The submission

The submission

There is no

contains novel

lacks novel

does not contain

novelty

contribution in

contributions.

novel

key segments,

There are some

contributions.

Problem #1

which is a clear

evidences of

However, there

indication of

novelty,

is an evidence of

application

however, it is not

some effort

knowledge

significant

Clarity

The written or

The written or

The written or

Failed to prove

graphical

graphical

graphical

the clarity. Need

materials, and

materials and

materials, and

proper

developed

developed

developed

background

applications

applications do

applications fail

knowledge to

Problem #1

provide a clear

not show clear

to prove the

perform the tasks

picture of the

picture of the

clarity.

concept, and

concept. There is

Background

highlights the

room for

knowledge is

clarity

improvement

needed

Citation:

McKinney, B. (2018). The impact of program-wide discussion board grading rubrics on students’ and faculty satisfaction. Online Learning, 22(2), 289-299.

This assignment requires you to submit programming codes on gitlab, and a single PDF file on Brightspace.

Problem #1: This problem contains three tasks.

Task 1: Cluster Setup – Apache Spark Framework on GCP

(if no GCP credit available – Hadoop or Spark setup in personal Linux machine)

Using your GCP cloud account, configure and initialize Apache Spark cluster.

Create a flowchart or write ½ page explanation on how you completed the task, include this part in your PDF file.

Task 2: Data Extraction and Preprocessing Engine: Sources – NewsAPI

Steps for NewsAPI Operation

Step 1: Visit the news API https://newsapi.org/

Step 2: Create a developer account

Step 3: Search keywords – “Canada”, “University”, “Dalhousie”, “Halifax”, “Canada Education”, “Moncton”, “Toronto”, “

Summer 2021 saurabh.dey@dal.ca

Step 3: Write a well-formed script/program using Java to extract data (Extraction Engine) from NewsAPI.

(Do not use any online program codes or scripts, which is not part of the official API documentation and specification.)

Step 4: You need to include an appropriate pseudocode of your data extraction program in the PDF file.

Step 5: The captured raw data should be kept (programmatically) in files. Each file should not contain more than 5 news articles. These files will be needed for “Problem #1-Task 3”

Step 6: Your program (Filtration Engine) should automatically clean and transform the data stored in the files, and then upload each record to new MongodB database myMongoNews

  1. For cleaning and transformation -Remove special characters, URLs, emoticons etc.

  1. Write your own regular expression logic. You cannot use libraries such as, jsoup, JTidy etc.

Step 7: You need to include a flowchart of Step 6 in the PDF file.

Task 3: Data Processing using Spark – MapReduce (written in Java) to perform count

Step 1: Write a MapReduce program (WordCounter Engine) to count (frequency count) the following substrings or words. Your MapReduce should perform the frequency count on the stored raw news files (titles and contents of the news articles)

  1. Canada”, “Nova Scotia”, “education”, “higher”, “learning”, “city”, “accommodation”, “price” – (case sensitive)

  1. You need to include a flowchart/algorithm of your MapReduce program on the PDF file.

Step 2: In your PDF file, report the words that have highest and lowest frequencies (it must be computed programmatically).

Problem #2: This problem contains one task

Task 1: Data Visualization using Graph Database – Neo4j for graph generation

Step 1: Explore Neo4j graph database, understand the concept, and learn cypher query language

Step 2: Visit NovaScotia parks website that you used in Assignment 1.

Step 3: Using Cypher, create graph nodes with

  • names of each region (e.g. Cape Breton Island Parks) as node, and

  • names of parks as nodes.

You should add properties to the nodes. For adding properties, you should check the dataset that you used in Assignment 1. E.g. location, street name, size etc. could be added as properties

  1. All regions are parts of Nova Scotia, so all regions should be connected using edges.

  1. Each region has multiple parks, and therefore, there should be edges between parks and

the region.

  1. Once the graph is constructed on Neo4j – using cypher language, find which region has

more number of parks. Provide the screenshot on the PDF file.

  1. Include all your Cyphers (graph construction, find query etc.) and generated graph image in the PDF file.

Summer 2021 saurabh.dey@dal.ca

Assignment 3 Submission Format:

1) Compress all your reports/files into a single .zip file and give it a meaningful name.

You are free to choose any meaningful file name, preferably – BannerId_Lastname_firstname_5408_A3 but avoid generic names like assignment-3.

2) Submit your reports only in PDF format.

Please avoid submitting .doc/.docx and submit only the PDF version. You can merge all the reports into a single PDF. You should also include output (if any) and test cases (if any) in the PDF file.

3) Your executable code/script needs to be submitted on https://git.cs.dal.ca/

Summer 2021 saurabh.dey@dal.ca

Assignment #3 Solution
$30.00 $24.00