Description
Assignments are to be uploaded via the Blackboard portal
Note: There may be short quiz questions about the readings or articles and other questions in the class period when they are due.
- Obtain our texts
- Tom White. 2015. Hadoop: The Definitive Guide (4th ed.). O’Reilly Media, Inc (TW)
- Pramod J. Sadalage and Martin Fowler. 2012. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley.(PS)
- Read from (TW)
- Chapter 1 (note this chapter is also on Blackboard “Free Books and Chapters” so you don’t need to wait for the book to arrive)
- Chapter 3
- (2 points) Submit very brief answers (or bullet points) to the following questions:
- (Remote/Online Students Only) What location or time zone are you in when you attend the course?
- Describe any prior experience you might have with use of public cloud, data mining, machine learning, statistics, data science and big data.
- Share any big data interests and personal learning goals for the course.
- Indicate if there are additional topics in the scope of the course of special interest to you.
- Do you have any anticipated personal issues such as expected absences or other necessary accommodations with course impact? (Of course, these will be held in strictest confidence.)
- Read article on “Blackboard” in Articles section
- The Parable of Google Flu (just 3 pages!)
- (5 points) Answer each of the following questions about the article in just one to three sentences each:
- What was the problem with the Google flu detection algorithm?
- What is big data hubris?
- What approach could have been used to improve the Google flu detection algorithm?
- What is “algorithm dynamics?”
- What aspect of algorithm dynamics impacted the Google flu detection algorithm?
- (5 points) Set up an Amazon Web Services (AWS) cloud account, if you don’t already have one (see below for details), and then follow the tutorial about how to work with a storage service called S3. Since we will do most of our assignments using AWS, this will get you started. In a while we will come to understand S3 as one critical element of a big data processing architecture know as the “data lake.”
- The tutorial is in an accompanying pdf file called “AWS01.pdf”. The same information is available at the URL:
https://docs.aws.amazon.com/AmazonS3/latest/gsg/AmazonS3Basics.html
- If you already have an AWS account it is fine to use that one, so in the course of the tutorial you will just log in to your account, otherwise create a new one.
- If you can or prefer, you can set up an AWS educational account. It comes with free usage credits. So proceed as follows:
- Step 1: Access the AWS Educate website
- If you can or prefer, you can set up an AWS educational account. It comes with free usage credits. So proceed as follows:
https://aws.amazon.com/education/awseducate/ and click Apply Now.
-
-
- Step 2: Click to Apply for AWS Educate for Students.
- Step 3: Enter the information requested on the AWS Educate Student Application form.
- Step 4: Verify your email address and complete a captcha to verify that you are not a robot.
- Step 5: Click-through to accept AWS Educate Terms and Conditions.
- After the application is submitted:
- You will receive an email indicating that the application was received.
- AWS Educate reviews the application and performs any necessary validation.
- After you are accepted, a welcome message is forwarded to your email address. The message includes a link for the AWS Educate Student portal and an AWS credit code.
- If you need to or prefer to set up a standard AWS account for the first time, follow the instructions in “AWS01.pdf”. DO NOT select a support plan. They are costly and you don’t need one. See this URL for more details: https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/
-
- An overview of the S3 storage service is included in a section below for your reference.
- To receive credit for this question, provide a screen shot showing the S3 bucket you have created. The bucket name should be named something like “YourIITId-CSP554”
- When asked to upload an object to the S3 bucket you have created, just use any text file you have handy (even this one).
- Make sure to follow the instructions in the pdf file for deleting your bucket at the end of the assignment so you do not incur additional costs.
Overview of Amazon S3
Amazon S3 a simple web services that you can use to store and retrieve any amount of data. It is used as part of a big data architecture called a “data lake” that we will discuss later.
Advantages to Amazon S3
Amazon S3 is intentionally built with a minimal feature set that focuses on simplicity and robustness. Following are some of advantages of the Amazon S3 service:
- Create Buckets – Create and name a bucket that stores data. Buckets are the fundamental container in Amazon S3 for data storage.
- Store data in Buckets – Store an infinite amount of data in a bucket. Upload as many objects as you like into an Amazon S3 bucket. Each object can contain up to 5 TB of data. Each object is stored and retrieved using a unique developer-assigned key. Data stored in S3 is “write once, read many.” This means that once written, data can’t be appended to or updated.
- Download data – Download your data or enable others to do so. Download your data any time you like or allow others to do the same.
- Permissions – Grant or deny access to others who want to upload or download data into your Amazon S3 bucket. Grant upload and download permissions to three types of users. Authentication mechanisms can help keep data secure from unauthorized access.
- Standard interfaces – Use standards-based REST interface designed to work with any Internet-development toolkit.
Amazon S3 Concepts
This section describes key concepts and terminology you need to understand to use Amazon S3 effectively. They are presented in the order you will most likely encounter them.
Buckets
A bucket is a container for objects stored in Amazon S3. Every object is contained in a bucket. For example, if the object named photos/puppy.jpg is stored in the johnsmith bucket, then it is addressable using the URL http://johnsmith.s3.amazonaws.com/photos/puppy.jpg
Buckets serve several purposes: they organize the Amazon S3 namespace at the highest level, they identify the account responsible for storage and data transfer charges, they play a role in access control, and they serve as the unit of aggregation for usage reporting.
You can configure buckets so that they are created in a specific region. For more information, see Buckets and Regions. You can also configure a bucket so that every time an object is added to it, Amazon S3 generates a unique version ID and assigns it to the object. For more information, see Versioning.
Objects
Objects are the fundamental entities stored in Amazon S3. Objects consist of object data and metadata. The data portion is opaque to Amazon S3. The metadata is a set of name-value pairs that describe the object. These include some default metadata, such as the date last modified, and standard HTTP metadata, such as Content-Type. You can also specify custom metadata at the time the object is stored.
An object is uniquely identified within a bucket by a key (name) and a version ID. For more information, see Keys and Versioning.
Keys
A key is the unique identifier for an object within a bucket. Every object in a bucket has exactly one key. Because the combination of a bucket, key, and version ID uniquely identify each object, Amazon S3 can be thought of as a basic data map between “bucket + key + version” and the object itself. Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version. For example, in the URL http://doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, “doc” is the name of the bucket and “2006-03-01/AmazonS3.wsdl” is the key.