Final Project

Due Dates

Proposals Due: 11:59pm 02/06/2020 Form
Data Due: 11:59pm 03/10/2020 cs1951a_handin project_data
Analysis Due: 11:59pm 04/14/2020 cs1951a_handin project_analysis
Poster Due: 11:59pm 5/7 cs1951a_handin project_poster
Presentations: 5/7-5/11

Learning Goals

The final project is a way for you to explore and gather insights from some dataset(s) of your choosing as an actual data scientist would. In groups of four, you will extract, process, analyze and create visualizations of your data and present the insights from your work at a symposium at the end of the semester. Along the way, you will deliver small reports on aspects of your project. Each group will be assigned a mentor TA who will evaluate and guide each team. Group proposals and final presentations will be given to Ellie and the HTAs.

The goal of the project is for each group to properly define a hypothesis or engineering project and explore a dataset fully.

Project Types

Generally speaking, projects fall into one of two categories: hypothesis-testing and prediction. We provide a general rubric meant to encompass both types of projects, but keep in mind that you will be evaluated on the appropriateness of the tools you use, given your stated goal. E.g. projects which aim to test a hypothesis should use different ML/stats techniques than projects which aim to make predictions. The expectation is that you present a coherent, motivated final product.

Hypothesis-Testing Projects: The goal of the project is to make a generalizable claim about the world. Your hypothesis should be based on some prior observation or informed intuition. The best projects in this category are motivated by existing theories, evidence, or models from other scietific disciplines (e.g. public health, political science, sociology). Before pitching the project, you should ask yourself: Why is this hypothesis interesting? Who would care about the conclusions of my study? How would these findings potentially affect our understanding of the world at large? Successful projects from last year which fall into this category are below:

Emoji Anaylsis looked at how emoji use differed across languages.
College Data Anaylsis looked at how demographic trends have shifted across universities.

Prediction Projects: The goal of the project is to make accurate predictions about future/unseen events. The best projects in this category are motivated by useful commercial or social applications. "If we could predict X, we'd be able to do Y better." Before pitching the project, you should put yourself in the role of a data science consultant and ask yourself: Who would be hiring me to do this project? Why would the stakeholders care about my ability to do this well? What is the realistic setting in why my predictions will be used, and can I make my evaluation match that setting? Successful projects from last year which fall into this category are below:

RideShare Analysis tried to predict, for a new city with no bike share program, where would be the best locations to place new bike hubs, and how would demand change as a function of e.g. weather or day.
Music Recommendation Systems (there were several) tried to generate playlists based on individuals' past music preferences and/or natural language descriptions of their tastes.

Capstone Projects: Groups which are doing capstone can do either a hypothesis-testing or a prediction project, but will be required to have a substantial engineering and UI/interactive component. The project pitch should define a clear system or software product spec. The end project should include something "demoable", e.g. a web application or local application that interacts with users in real time.

Project Requirements

Note that every project will be required to meet a minimal threshold in each of three components: a data component, an ML/stats component, and a visualization component. But not every project will invest equally in all three sections. Projects which take on more ambitious data scraping/cleaning efforts will be forgiven for having skimpier visualizations. Projects with ambitious visualizations or UIs will be forgiven if their statistics are more basic. But the project still needs to be complete--you have to understand your data, and be able to answer intelligently when asked about the limitations of your methods and/or findings.

Group of 4 students
Project must have a clear hypothesis or prediction goal, and have the ability to answer the questions laid out in the section above.
Each project must create a database, and perfom some non-trivial data collection and cleaning in order to do so. You may either scrape your own data or use existing data/APIs. If you use existing data/APIs, you will be required to join at least two datasets to create your database.
Each project must have a minimum of 5 analysis aspects

Use at least two machine learning or statistical analysis techniques to analyze your data, explain what you did, and talk about the inferences you uncovered.
Provide at least two distinct visualizations of your data or final results. This means two different techniques. If you use bar charts to analyze one aspect of your data, while you may use bar charts again, the second use will not count as a distinct visualization
The additional one technique can be either stats/ml or graphs

Every project will be hosted on the course website. Each deliverable will be posted as a link on the page and final posters will be shown there.
Each group will be paired with another group who will give each other feedback and questions throughout the semester.
Ethical Considerations:

Throughout the final project process, you will be expected to think critically and write about where your data is coming from, how you're analyzing and visualizing your data, and potential positive or negative consequences of your results.

Capstone Requirements:

All group members must agree to be held to that capstone standard. Even if not everyone in the group is taking the course as a capstone, the entire project will receive the capstone evaluation and be graded accordingly.
If you choose to use this course as a capstone, you will extend your project to have a full-fledged web application with an interactive component. For example, previous capstones have included web UIs for plotting roadtrips across the United States and restaurant recommendation apps.

Please see each section below for more details on each deliverable!
We have included a stencil HTML file you can use for each deliverable here

Proposal

Due: 11:59pm 02/06/2020

Handin: Form

Every group of four will present to Ellie + the HTAs their project proposal. Each group should submit the google form and then look for their assigned slot (emailed out). The presentation should take 1-2 minutes and summarize the following:

What is the project goal? If a hypothesis-testing project: what is the hypothesis, where did it come from, and why is it interesting? If a prediction project: what are you trying to predict, who would be the stakeholders/who cares if you do this well, and how do you measure your success?
Where do you plan to get your data? Its okay if this is not a sure-thing, as long as you have an idea of where to start. E.g. "we are going to try to crawl blahblah.com" is fine.
What are the major technical challenges you anticipate?
What are ethical problems could you foresee arising (either in the course of doing the project or, if you succeed, in the existence of the technology/result itself)?

Your direction may change as the course goes on: this is okay and why we are starting so early. Until the data deliverable, you are allowed to change your goals and discuss your evolving strategies by consulting with your group’s mentor TA. You will be assigned a mentor TA shortly after your proposal is approved.

Data

Due: 11:59pm 03/10/2020

Handin: `cs1951a_handin project_data`

By the first check-in, you should have collected your data and cleaned it. For different projects this will vary. If you are scraping, you should have already written and run your scraper (exceptions apply if your project requires the scraper to run continuously e.g. to get updated data throughout the semester). If you are using existing datasets, you should have finished any cleaning, joining, or organizing you need to do. In either case, you should have the data in some clean access method like SQL, Pandas, JSON, or similar datastore, and be in a position to begin analysis and/or modeling.

Concretely, you should have the following on your submission:

A complete data spec describing your collected data. This can be in the form of a README/simple text file, but it should describe the full data format, including assumptions about data types, assumptions about keys and cross-references, whether fields are "required" or optional, etc. For example this is an example of a good README for data in a CSV format (description of corresponding data is here) and this repo has a good example of README for data in JSON format.
A link to your full data in downloadable form. Any distribution method is okay, pick what makes sense for your project. E.g. Google drive, DropBox, GitHub, or link from personal website are all fine.
A sample of your data (e.g. 10 - 100 rows) that we can easily open and view on our computers.
A concise tech report in html format, answering the following questions:

Where is the data from?
- How did you collect your data?
- Is the source reputable?
- How did you generate the sample? Is it comparably small or large? Is it representative or is it likely to exhibit some kind of sampling bias?
- Are there any other considerations you took into account when collecting your data? This is open-ended based on your data; feel free to leave this blank. (Example: If it's user data, is it public/are they consenting to have their data used? Is the data potentially skewed in any direction?)
How clean is the data? Does this data contain what you need in order to complete the project you proposed to do? (Each team will have to go about answering this question differently, but use the following questions as a guide. Graphs and tables are highly encouraged if they allow you to answer these questions more succinctly.)

How many data points are there total? How many are there in each group you care about (e.g. if you are dividing your data into positive/negative examples, are they split evenly)? Do you think this is enough data to do what you hope to do?
Are there missing values? Do these occur in fields that are important for your project's goals?
Are there duplicates? Do these occur in fields that are important for your project's goals?
How is the data distributed? Is it uniform or skewed? Are there outliers? What are the min/max values? (focus on the fields that are most relevant to your project goals)
Are there any data type issues (e.g. words in fields that were supposed to be numeric)? Where are these coming from? (E.g. a bug in your scraper? User input?) How will you fix them?
Do you need to throw any data away? What data? Why? Any reason this might affect the analyses you are able to run or the conclusions you are able to draw?

Summarize any challenges or observations you have made since collecting your data. Then, discuss your next steps and how your data collection has impacted the type of analysis you will perform. (approximately 3-5 sentences)

Handin:

~/course/cs1951a/project_data

data.html

website

data.html

Analysis

Due: 11:59pm 04/14/2020

Handin: `cs1951a_handin project_analysis`

The second check-in will focus on your analysis. At this point each group should have used statistics or a machine learning technique to answer one of their primary questions. Coupled with this is it is expected that you have visualized this result. The visualization may be a table, graph or some other means for conveying the result of the question. We expect that you will not be able to provide a complete solution for any prediction or hypothesis goal with a single ML or statistical model. We want to see you document all discussion, iteration, and evaluation as you investigate your question. Concretely you should have the following on your project page:

A defined hypothesis or prediction task, with clearly stated metrics for success.
Why did you use this statistical test or ML algorithm? Which other tests did you consider or evaluate? How did you measure success or failure? Why that metric/value? What challenges did you face evaluating the model? Did you have to clean or restructure your data?
What is your interpretation of the results? Do accept or deny the hypothesis, or are you satisfied with your prediction accuracy? For prediction projects, we expect you to argue why you got the accuracy/success metric you have. Intuitively, how do you react to the results? Are you confident in the results?
For your visualization, why did you pick this graph? What alternative ways might you communicate the result? Where there any challenges visualizing the results, if so, then what where they? Will your visualization require text to provide context or is it standalone (either is fine, but it's recognize which type your visualization is)?
Full results + graphs (at least 1 stats/ml test and at least 1 visualization). Depending on your model/test/project we would ideally like you to show us your full process so we can evaluate how you conducted the test!
If you did a statistics test, are there any confounding trends or variables you might be observing?
If you did a machine learning model, why did you choose this machine learning technique? Does your data have any sensitive/protected attributes that could affect your machine learning model?
Discussion of visualization/explaining your results on a poster and a discussion of future directions.

~/course/cs1951a/project_analysis

analysis.html

website

Mid Term Feedback

No deliverable or handin

After you hand in your analysis document, the HTAs + Ellie will send out sign ups for providing feedback. Each group will be assigned a short time slot for a Zoom call. Not all members must be present, especially given the challenges with potentially coordinating across timezones. More details will come when we release sign ups.

Final Presentations

Poster session: Cancelled :(

Presentations: 5/8-5/11 Times TBA

Poster Due: 11:59pm 5/7

Handin: `cs1951a_handin project_poster`

Overview

You are required to complete:

PDF Poster (see details below)
One Page Abstract (see details below)
Interactive code (only if capstone)

Additional Relevant Rubrics/Discussion/Notes on Grading:

Presentation Rubric

Final Project Expectations

General Project Expectations

Poster Requirements

42 inches x 31.5 inches
PDF format
at least 300 dpi

This slideshow from Cornell
This link from NYU
This comprehensive list of do's and don'ts
This poster from a Brown CS research group

A name for your project
The names of your group members
Dataset information and collection details
Problem statement / hypothesis
Methodology
Results & visualization
Potential significance/ramifications on the field your data is coming from and/or relevant limitations of the analysis

Abstract Requirements

We want every group to write a summary of their work which could be sent out as a one-page standalone document. We provide two example formats below and encourage you to follow these templates exactly. The goals here are:

Speak to a 3rd party audience who has not seen your project yet
Motivate your project goal
Clearly define claims or questions
Present the answers you found for those claims/questions

abstract.pdf

Hypothesis Example: here
Prediction Example: here
Hypothesis Stencil: here
Prediction Stencil: here

Capstone Requirements

If you are completing the capstone requirements, you should handin the code for your interactive tool with your poster. To reiterate, your interactive element can be anything that takes user input and updates/adjusts. It can be commandline, webapp, D3, or any other interface. You will be expected to demo this element as part of your presentation.

Handin

~/course/cs1951a/project_poster

poster.pdf
abstract.pdf
Any other files you think are necessary to turn in

poster.pdf

website

Presentation

60	big problems, e.g. missing large portions of deliverable
70	The team's analysis suggests misunderstanding of the dataset and/or problem, usually in one of the following ways: 1) the methods chosen do not make sense given the stated goal of the problem (e.g. proposed to test a hypothesis but then focus effort on training a classifier) 2) the techniques used are significantly misused (e.g. testing on training data, interpreting r^2 as evidence for/against a hypothesis) 3) presentation of the results raises many questions that go unanswered (see examples listed in description of 75) 4) discussion of results/descriptions of methods are flatly incorrect 5) other errors that similarly indicate a lack of understanding.
75	The team is familiar with the data and/or problem, but hasn't demonstrated a clear understanding of both in conjunction (e.g. how dataset properties/processing decisions relate to the problem). The team knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance, but the some discussion or presentation of the results raises many questions that go unanswered. E.g. charts show gaps/skews/outliers that are not discussed or addressed, metrics take extreme values that are not explained, results behave unintuitively (test acc > train acc) and no comment is made or explanation provided, pvals/coefficients are misinterpreted, etc. Discussion overall is weak.
80	The team is familiar with the data and/or problem, but hasn't demonstrated a clear understanding of both in conjunction (e.g. how dataset properties/processing decisions relate to the problem). The team knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance, but rationale isn't entirely clear or convincing. Results are not clearly discussed in relation to the overall goal of the project, and/or discussion feels "out of the box". Team has a plan for where to go next.
85	The team is familiar data and the problem, and knows how to apply applicable techniques from class to test hypothesis/demonstrate model performance with rationale as to why those technique are chosen. Results are related to the overall goal of the project, but the discussion feels "out of the box" and/or lacks depth. Team has a good plan for where to go next.
90	The team understands the data and the problem, and knows how to run the first most logical test needed to test hypothesis/demonstrate model performance. Results are interpreted in relation to the overall goal of the project, but maybe not as deeply as would be ideal (e.g. some outliers/data quirks are not commented on). Interpretation is precise and scientific, claims are evidence-based or hedged appropriately. Team has a good, informed plan for where to go next.
95	Clear demonstration that the team understands the data and the problem, and knows how to run not just the first most logical test, but also a logical follow up (including motivation for follow-up analysis, refinement of question when needed). Obvious weirdness in data (e.g. outliers, skews) are noted and explained. Results are interpreted "one level up" from the literal output of the test, and discussed in relation to the overall goal of the project. Interpretation is precise and scientific, claims are evidence-based or are hedged appropriately. Good plan for next steps.

Rubric

Points	Component	Grader
5	Project Proposal	Ellie + HTAs
10	Data Deliverable	Ellie + HTAs
20	Analysis Deliverable	Ellie + HTAs
50	Poster Presentation to Ellie + HTAs	Ellie + HTAs
15	Mentor TA Evaluation	Mentor TA

Final Project

Due Dates

Learning Goals

Project Types

Project Requirements

Proposal

Due: 11:59pm 02/06/2020

Handin: Form

Data

Due: 11:59pm 03/10/2020

Handin: cs1951a_handin project_data

Handin:

Analysis

Due: 11:59pm 04/14/2020

Handin: cs1951a_handin project_analysis

Mid Term Feedback

No deliverable or handin

Final Presentations

Poster session: Cancelled :(

Presentations: 5/8-5/11 Times TBA

Poster Due: 11:59pm 5/7

Handin: cs1951a_handin project_poster

Overview

Additional Relevant Rubrics/Discussion/Notes on Grading:

Poster Requirements

Abstract Requirements

Capstone Requirements

Handin

Presentation

Rubric

Handin: `cs1951a_handin project_data`

Handin: `cs1951a_handin project_analysis`

Handin: `cs1951a_handin project_poster`