cs1951a_install sql
/course/cs1951a/pub/sql/stencil/data/
cs1951a_handin sql
The cs1951a_install sql
script installs all the stencil code you will need to your cs1951a course directory, which is ~/course/cs1951a/
.
/course/cs1951a/pub/sql/stencil/data/
and copy them, or you can run:
cp -r /course/cs1951a/pub/sql/stencil/data ~/course/cs1951a/sql/
Throughout the course, we will be learning techniques that allow us to better
understand and work with Big Data. These tools are extremely powerful and can be
used in a myriad of ways, spanning from collecting user data online to creating
predictive machine learning models. As a result, many professionals have felt the
need to create ethical standards that all data scientists should follow. For example,
the ACM (Association for Computing Machinery), which is the world's largest education and
scientific computing society, recently updated their Code of Ethics to address recent
advances in technology that have allowed computing to spread to all aspects of life.
Read the following articles and respond to the questions in writeup.txt
.
SQLite is installed on all department machines. It can be accessed from the command line using sqlite3
.
Running sqlite3 somedb.db
from your terminal will launch an environment that will allow you to type your SQL queries directly into the terminal. You can exit this environment by pushing Ctrl+D or by typing .exit
and pressing enter.
As a more explicit example, to open a sql environment where you can query the movies.db
database, you can type:
$ sqlite3 movies.db
To execute a SQL statement that you have saved in a solution file, you can run the following command:
$ sqlite3 movies.db < sql_solutions.sql
This part of the assignment builds off the exercises you completed in the lab. If you have not yet completed the first half of the lab, please do so before starting this assignment. The database and schema are described again below, but are the same from the lab.
We have provided a database named people.db
with the name, age, ID, and occupation of some Brown students and alumni.
Here is the schema:
people_main(ID INTEGER, name TEXT, occupation TEXT, age INTEGER) people_likes(ID1 INTEGER, ID2 INTEGER) people_friends(ID1 INTEGER, ID2 INTEGER)
In the people_main table, ID is a unique identifier for a particular student or alumni. name, occupation and age correspond to the person's first name, occupation and age.
In the people_friends table, each (ID1, ID2) pair indicates that the particular person with ID1 is friends with the person with ID2 (and vice versa). The friendship is mutual, and if (ID1, ID2) is in the table, it is guaranteed that (ID2, ID1) exists in the table.
In the people_likes table, each (ID1, ID2) pair indicates that the student or alumni with ID1 likes the person with ID2. The (ID1, ID2) pair in the table does not guarantee that the (ID2, ID1) pair also exists in the table.
Your job is to write SQL queries for the data being requested:
part1_problem1.sql
Hint: Use a LEFT JOIN
… the following website is quite useful: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
part1_problem2.sql
Hint: The LIMIT
statement will come in handy!
part1_problem3.sql
Hint: You'll need to take a look at the HAVING
function.
part1_problem4.sql
You can do this problem without using two SELECT
statements, but doing so makes sure that all your answers are distinct. Also, future problems will require using two SELECT
statements, so practicing here isn't a bad idea!
part1_problem5.sql
Time to join stuff!
For this part of the assignment, you will be using the TMDB Movie Dataset, which has been exported to the movies.db
database. The database
schema is as follows:
movies(budget INTEGER, homepage TEXT, id INTEGER, original_language TEXT, original_title TEXT, overview TEXT, popularity REAL, release_date TEXT, revenue REAL, runtime INTEGER, status TEXT, tagline TEXT, title TEXT, vote_average REAL, vote_count INTEGER) scores(review TEXT, min_score INTEGER, max_score INTEGER)
We encourage you to use the WITH
operator, which lets you divide your query into separate queries. As an example, we can define a subquery and
use it in another query as follows:
WITH subquery AS ( SELECT original_title, vote_average FROM movies ) SELECT original_title FROM subquery );
You can add days to a particular day
by using the date
function.
For example, in order to add 3 days to to '2012-07-16', you can use date('2012-07-16', '+3 days')
Hint: The UNION
statement should come in handy.
Hint: You may want to look into CASE
statements and the LIKE
operator.
Hint: You may want to look into the EXISTS
operator. Additionally, think about possible edge cases.
scores
table. For example, movies with a vote average between 2 and 3.9 (inclusive) are reviewed as 'poor',
whereas movies with a vote average between 9 and 10 (inclusive) are reviewed as 'excellent'.
If a movie is reviewed as 'awful' or 'poor' then original_title should read 'do not watch'.
Results should be ordered by id (ascending).
For example, the output should have the following format:
'Snow White' | 8.7 | 'great' 'Toy Story' | 9.3 | 'must see' 'do not watch' | 2.3 | 'poor'
Hint: Look into the BETWEEN
statement and how it can be used in a join.
(10 points) Write a SQL query that, for each original_language that has more than 2 movies , finds the number of movies that were reviewed as 'poor' and the number of movies that were reviewed as 'good'.
Like in the 4th question, you will need to look at the scores
table to see what the review categories are and use the vote_average field of a movie to determine which review category it falls under. Your query should return 3 columns (original_language, num_poor which is the number of 'poor' movies for that language and num_good which should be the number of 'good' movies for the language. Your results should be ordered by number of 'good' movies (descending) and then number of 'poor' movies (ascending). Remember to only include languages that have more than 2 movies!
We have provided you with the athletes.db
database, although querying it is not necessary at all. The schema is as follows:
school_athletes(ID INTEGER, name TEXT, school TEXT, performance_score INTEGER, team TEXT)
For the query below, explain why the given query might not be the most efficient way to accomplish the task. Write out an optimized version of the query in writeup.txt
. Explain what steps you took to optimize it and why your version would be more efficient.
(6 points) The SQL query to optimize is as follows:
SELECT ID, name FROM school_athletes AS athletes WHERE school = 'Brown' and performance_score > ( SELECT AVG(performance_score) FROM school_athletes WHERE team = athletes.team );
(4 points) Consider two tables. Table A is very long with 1 billion rows and 5 columns. Table B is very wide with 1000 rows and 10,000 columns. If we were to join the two tables and want to make sure the join is performant, how should we best filter the tables? Assume that we can select from each table and then join the results. Specifically, state the table in which we should use WHERE heavily and the table in which we should be careful about what values we use in our SELECT.
The assignment directory present in your course directory, which is
~/course/cs1951a/sqlwould be the one used to hand in the solution files. Please make sure that this directory contains the following:
cs1951a_handin sql
to turn in your assignment.Please do not hand in the data files!
Made with ♥ by Jens and Esteban, updated for Spring 2020 by Mounika
Movie Database from: https://www.kaggle.com/tmdb/tmdb-movie-metadata