Due Dates


Learning Goals

The final project is a way for you to explore and gather insights from some dataset(s) of your choosing as an actual data scientist would. In groups of four, you will extract, process, analyze and create visualizations of your data and present the insights from your work at a symposium at the end of the semester. Along the way, you will deliver small reports on aspects of your project. Each group will be assigned a mentor TA who will evaluate and guide each team. Group proposals and final presentations will be given to Ellie and the HTAs.

The goal of the project is for each group to properly define a hypothesis or engineering project and explore a dataset fully.


Project Types

Generally speaking, projects fall into one of two categories: hypothesis-testing and prediction. We provide a general rubric meant to encompass both types of projects, but keep in mind that you will be evaluated on the appropriateness of the tools you use, given your stated goal. E.g. projects which aim to test a hypothesis should use different ML/stats techniques than projects which aim to make predictions. The expectation is that you present a coherent, motivated final product.

Hypothesis-Testing Projects: The goal of the project is to make a generalizable claim about the world. Your hypothesis should be based on some prior observation or informed intuition. The best projects in this category are motivated by existing theories, evidence, or models from other scietific disciplines (e.g. public health, political science, sociology). Before pitching the project, you should ask yourself: Why is this hypothesis interesting? Who would care about the conclusions of my study? How would these findings potentially affect our understanding of the world at large? Successful projects from last year which fall into this category are below:

Prediction Projects: The goal of the project is to make accurate predictions about future/unseen events. The best projects in this category are motivated by useful commercial or social applications. "If we could predict X, we'd be able to do Y better." Before pitching the project, you should put yourself in the role of a data science consultant and ask yourself: Who would be hiring me to do this project? Why would the stakeholders care about my ability to do this well? What is the realistic setting in why my predictions will be used, and can I make my evaluation match that setting? Successful projects from last year which fall into this category are below:

Capstone Projects: Groups which are doing capstone can do either a hypothesis-testing or a prediction project, but will be required to have a substantial engineering and UI/interactive component. The project pitch should define a clear system or software product spec. The end project should include something "demoable", e.g. a web application or local application that interacts with users in real time.


Project Requirements

Note that every project will be required to meet a minimal threshold in each of three components: a data component, an ML/stats component, and a visualization component. But not every project will invest equally in all three sections. Projects which take on more ambitious data scraping/cleaning efforts will be forgiven for having skimpier visualizations. Projects with ambitious visualizations or UIs will be forgiven if their statistics are more basic. But the project still needs to be complete--you have to understand your data, and be able to answer intelligently when asked about the limitations of your methods and/or findings.


Proposal

Due: 11:59pm 02/06/2020
Handin: Form

Every group of four will present to Ellie + the HTAs their project proposal. Each group should submit the google form and then look for their assigned slot (emailed out). The presentation should take 1-2 minutes and summarize the following:

Your direction may change as the course goes on: this is okay and why we are starting so early. Until the data deliverable, you are allowed to change your goals and discuss your evolving strategies by consulting with your group’s mentor TA. You will be assigned a mentor TA shortly after your proposal is approved.

Data

Due: 11:59pm 03/10/2020
Handin: cs1951a_handin project_data

By the first check-in, you should have collected your data and cleaned it. For different projects this will vary. If you are scraping, you should have already written and run your scraper (exceptions apply if your project requires the scraper to run continuously e.g. to get updated data throughout the semester). If you are using existing datasets, you should have finished any cleaning, joining, or organizing you need to do. In either case, you should have the data in some clean access method like SQL, Pandas, JSON, or similar datastore, and be in a position to begin analysis and/or modeling.

Concretely, you should have the following on your submission: