CS513: Theory & Practice of Data Cleaning Final Project

$30.00 $24.00

The goal of the group project is to conduct an end-to-end data cleaning project, using the various tools and techniques that we have covered throughout the course. In addition to the main tools that we used in class (i.e., RegEx, OpenRefine, Datalog, SQL, and Python), you are welcome to use other tools. For example, you…

Rate this product

You’ll get a: zip file solution

 

Categorys:

Description

Rate this product

The goal of the group project is to conduct an end-to-end data cleaning project, using the various tools and techniques that we have covered throughout the course. In addition to the main tools that we used in class (i.e., RegEx, OpenRefine, Datalog, SQL, and Python), you are welcome to use other tools. For example, you may want to use any of the research prototypes mentioned (e.g., YesWorkflow or the OpenRefine companion tools such as or2yw), or commercial tools (e.g., Trifacta Wrangler, Tableau, etc.) In your report you will then document how you used them.

Project Phases. The project is organized into two phases, each with its own set of deliverables.

See Coursera for the timeline and details.

Project Phase-I

During this phase your team needs to . . .

  1. Identify a dataset D of interest. You are required to use one of the provided datasets1 (CD database; NYPL historic menus; PPP loan applications; . . . ).

  1. Describe the dataset D. For example, you can provide a conceptual model (ER diagram) that depicts the entity types and relationship types, or an ontology that illustrates the main classes and their relationships. Or you can provide a database schema that illustrates and explains the structure and contents of the dataset. You should also add a short narrative, i.e., one or more paragraphs in English to describe the origin of the data and any relevant metadata (e.g., a temporal or spatial extent). A dataset about farmers markets, e.g., can be described with a relational schema (e.g., CREATE TABLE statements); the narrative would then explain what the different columns (attributes) mean. Other metadata may describe, e.g., the spatial extent of the data (only Illinois markets? All of the Midwest? Or the US?), and the temporal extent (for which period is the data correct?), etc.

  1. Develop three use cases. By use case we mean a written scenario describing a hypothetical data analysis. (If helpful, you can think of these as a queries or questions asked of the dataset, see Additional Information below).

    1. Target (main) use case: U1 for D such that data cleaning is necessary and sufficient to support the data analysis use case. Thus, after performing data cleaning, your cleaned data D is fit-for-purpose (i.e., for U1).

    1. Zero data cleaning” use case: U0 should be a use case that requires “zero data cleaning”, i.e., D is “good enough as it is”.

    1. Never enough” use case: U2 is a use case for which the given dataset D is “never (good) enough”, i.e., no amount of data cleaning or wrangling will make D suitable for U2 (even though at first sight one might think so).

    1. Note: The purpose of the corner cases U0 (data cleaning is not necessary) and U2 (data cleaning is not sufficient) is to reinforce the concept that data cleaning should be done with a purpose in mind, i.e., a use case such as your main use case U1, where data cleaning really makes a difference.

  1. List obvious data quality problems (i.e., which are easy to spot during Phase-I). In order for your dataset D and main use case U1 to match, data cleaning must be necessary and sufficient to implement U1. You need to support this claim by documenting data quality problems that your inspection of D has revealed and that need to be addressed before U1 can be tackeled.

  1. Devise an initial plan that outlines how you intend to clean the dataset in Phase-II. A typical plan for the overall project will include the following steps: S1: description of dataset D and matching use case U1; S2: profiling of D to identify the quality problems P that need to be addressed to support U1; S3: performing the data cleaning process using one or more tools to address the problems P (here you should describe which tools you are planning to use, e.g., OpenRefine; Python; etc.) S4: checking that your new dataset D is an improved version of D, e.g., by documenting that certain problems P are now absent and that U1 is now supported; S5: documenting the types and amount of changes that have been executed on D to obtain D .

You should also include a tentative assignment of tasks to team members (who does what)!

Additional Information

Regarding (2): How do you specify data analysis use cases? Generally speaking, you can simply explain the use case in a short paragraph. You might also want to be more specific and phrase use cases as questions: What is it that we want to know from (or about) the data?

In particular, a use case may be a set of database queries Q1, . . . , Qn against the dataset D (e.g., how many farmers markets offer bakery goods in addition to vegetables and fruits?) On the other hand, use cases may also be more general, e.g., you could state that you’d like to develop a web application that serves a particular purpose.

The advantage of specifying a use case U as one or more queries QU is that you can be very precise about when data cleaning is necessary and sufficient for U: if running QU on the original (“dirty”) data D would result in an answer A = QU (D ) that is incorrect and/or misleading, then data cleaning is necessary. Conversely, data cleaning is sufficient if the answer A = QU (D ) on the cleaned dataset D is correct (and not misleading).

In (4) above, how do you document data quality problems? One simple way is to include (copy-pasted) snippets of “dirty data” in your Phase-I report (you can also use screenshots for illustration) and then explain what the problem is in narrative form.

How do you describe your plan in (5)? A short list of your planned steps S1, . . . , S5 will do during Phase-I. In Phase-II, you should also include a workflow diagram for the actual data cleaning steps that you performed (e.g., with YesWorkflow or any other diagramming tool). Of course your Phase-I plan and your actual Phase-II workflow might be different.

Project Phase-II

During this phase you will execute the plans you’ve come up with in Phase-I (possibly adjusting course, based on what you find when actually working with the data . . . )

What to Submit

Phase-I:

  • A single PDF file with your Phase-I report (in narrative form) with all 5 elements de-scribed above.

Phase-II:

  • A single PDF file with your Phase-II report. This report should include:

A description of the actual data cleaning workflow W that was performed, and a comparison with the original Phase-I plan: e.g., were you able to execute the steps as planned, and if not, what did you have to change and why?

A narrative that ties all steps together and explains the motivation (use case U1), the rationale for the design of the overall workflow W and the tools used.

Documentation that data quality was improved, e.g., through running “before queries” QU (D) and “after queries” QU (D ) on D (original) and D (cleaned), respectively.

A summary of the data changes ∆D resulting from the overall workflow W : D D .

A summary of findings, problems encountered, and lessons learned, including pos-sible next steps (e.g., how would you implement the main use case U1).

  • Supplementary Materials. In addition to the project report, you need to provide the fol-lowing supplementary materials (as a single ZIP file):

  1. Workflow Model: For the overall workflow model W , when using YesWorkflow, pro-vide the text file that has the YW annotations (e.g., Workflow.yw), and the generated Graphviz (dot) file (e.g., Workflow.gv). For other diagramming tools, provide a source file (e.g., PPTX, . . . ) and a PDF file (Workflow.pdf).

  1. OpenRefine Operation History: If you used OpenRefine, then include a copy of the operation history (copy-paste it into a json file named OpenRefineHistory.json).

If you also want to visualize the OpenRefine history, you can use the or2yw tool.2

  1. Other History: If you are using an alternative tool (instead of, or in addition to OpenRefine), please provide an analogous file (OtherToolHistory.json) and other provenance information if available for that tool: e.g., include Python (or R) scripts, Jupyter notebook files, etc.

  1. Queries: A copy of the queries written in SQL or Datalog to profile the dataset and check the integrity constraints (copy-paste them into a text file named queries.txt).

  1. Original (“dirty”) and Cleaned Datasets: Please do not provide the datasets in the ZIP file. Rather, upload the raw and cleaned datasets in a Box folder and share the link in a plain text file (DataLinks.txt).

3

CS513: Theory & Practice of Data Cleaning Final Project
$30.00 $24.00