2 « Data Scraping and Cleaning »

Brief

Due date: February 13, 2019 at 11:59pm
Stencil: cs1951a_install scraping
Handin: cs1951a_handin scraping
Files to submit: assignment.py, written_questions.txt, and all SQL queries

Overview

Assignment

First, you'll need to collect some stock data. We'll make use of investing.com to collect information on the most active stocks in the market, collecting this data through web scraping. We'll supplement this with historical data about these stocks gathered through API requests.

You'll then be responsible for cleaning the data, creating a database from it, and analyzing stocks by querying your database.

If you are developing from a department machine, you will need to use the course virtual environment to import the requests and BeautifulSoup bs4 libraries. The venv can be activated by typing source /course/cs1951a/venv/bin/activate. To deactivate this virtual environment, simply type deactivate.

If you are working remotely, your final handin should still work in the course virtual environment.

Part 1: Web Scraping

35 points

To get started, we’re going to want to collect some data on the most active stocks in the market. Conveniently, investing.com publishes this exact data. To collect this data, you’ll make use of web scraping.

For purposes of this assignment, we've made a copy of this page to keep the data static. Note, some of the data in our static copy is intentionally modified from real stock data to ensure you've cleaned your data and handled edge cases. As such, you will scrape from this URL: https://cs.brown.edu/courses/csci1951-a/resources/stocks_scraping_2020.html

Before scraping, you'll need your code to access this webpage. You should make use of the requests library to make an HTTP request and collect the HTML. If you're not familar with the requests library, you can read about it here.

Once you have accessed the HTML and assigned it to some variable, you'll want to scrape it, collecting the following for each stock in the table.

company name
stock symbol
price
change percentage
volume
HQ state

You'll use Beautiful Soup, a Python package, to scrape the HTML. This will require looking at the HTML structure of the investing.com page. You can select various HTML elements on a page by tag name, class name, and/or id. Using inspect element on your web browser, you can check what HTML tags and classes contain the relevant information.

Note: You should collect information from the 50 most active stocks in the investing.com table. This is what the investing.com HTML will contain by default.

Hint: All extracted information will be strings. You’ll want to make sure the price and percent change are floats (i.e., "24.5%" should become the float 24.5), and volume is an integer. You should also parse the percent change to be a decimal, rather than a percentage.

Another Hint: You will probably want to look ahead to the queries you will ultimately ask in Part 4--this will affect the type of cleaning you need to do.

A web scraping example

Consider the following simple HTML page with an unordered lists:

<html>
    <body>
        <h1>Welcome to My Website
        <ul>
            <li>Coffee
            <li>Tea
            <li>Coke
        </ul>
    </body>
</html>

Imagine we want to get the items in the list. The ul tag indicates an unordered list. We’ll then want to get each list item (list items are in li tags). Specifically, we’ll want to extract the text inside each list item. To do this, we’ll use the following code, where page.text is the HTML of the page.

soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find(ul).find_all("li")

You’ll notice that items is a list of three items, since there are three list items in the unordered list. You’ll also see that items[0].text will give you the text of the first list item!

Part 2: API Requests

30 points

Rather than using web scraping to collect this data, we’ll make use of an API. You’ll make requests to this API using Python’s requests library. IEX Trading offers an API with various endpoints that offer information about stocks. Click here for a walkthrough on creating a free account.

Note: You have a limited number of API calls you can make with a free account. You'll start receiving 402 errors if you reach the limit - if this happens, use a different email address to make a new (free) account (A sneaky trick is to add a suffix to your email address using a "+". For example, if your email is ellie_pavlick@brown.edu, you can sign up with ellie_pavlick+1@brown.edu, ellie_pavlick+2@brown.edu, and so on. This means you don't have to make any new email accounts for this, you can just use the one you already have to generate infinitely many new free accounts.). You probably won't reach the limit, but to check if you're close, you can click the "Message Use" tab on the left side of the API console site. Free accounts give you 50,000 messages - 1 API call usually costs more than 1 message.

We’re going to want to collect two pieces of information for each stock in investing.com's most active stock table:

the average closing price of each of the most active stocks over the last 5 days
the number of articles recently written about each stock

To do this, you’ll want to make use of the chart endpoint to collect the historical stock pricing. Then, you will want to parse through the data and average the closing price for each day. IMPORTANT: Set the parameter "chartCloseOnly" to True when you request from the chart endpoint to avoid immediately reaching your API call limit! (Read about URL parameters here.)

Using the news endpoint, you should get the articles for a specific stock. Then, you should count how many articles were returned by the API. We will define recently as since February 1, 2020 00:00:00 . The news API includes the unix epoch time for each article in milliseconds. You should calculate the epox time of February 1, 2020 00:00:00 and convert to milliseconds (multiply by 1000). Feel free to hardcode this value and use websites to calculate it rather than calculate it by hand.

Hint: Some stocks from investing.com are not listed on major stock exchanges, and thus the IEX Trading API does not have data on them. In this case, the IEX Trading API will return a 404 status code. Your program should handle this error by disregarding stocks from investing.com if they are not present in the IEX Trading API. That is, these stocks should not be added to the database. You can check the status code of a request by checking requests.get(...).status_code

Part 3: Databases

15 points

You now realize that to truly harness the data, you need to turn it into a database you can query. Using the provided stencil, create a database with these tables:

companies
- symbol, a string of the stock symbol that is the primary key of this table
- name, a string of the company name
- location, a string of the company's HQ location
quotes
- symbol, a string of the stock symbol that is the primary key of this table
- price, the current stock price, a number
- avg_price, the average closing price over the last five days, a number
- num_articles, the number of recent articles about this stock
- volume, the volume of this stock as a number
- change_pct, the percent change in the stock’s price today, as a decimal

Working with databases in Python

To create a connection to the database, and a cursor, we include the following lines in the stencil:

# Create connection to database
conn = sqlite3.connect('data.db')
c = conn.cursor()

We also prepare the database for you by clearing out relevant tables if they already exist. This allows you to run your code multiple times and replace your old version of data.

# Delete tables if they exist
c.execute('DROP TABLE IF EXISTS "companies";')
c.execute('DROP TABLE IF EXISTS "quotes";')

To create a database table, you'd do something like this:

c.execute('CREATE TABLE person(person_id int not null, name text')
conn.commit()

To insert a row into a table, you'd do something like this:

c.execute('INSERT INTO person VALUES (?, ?)', (some_variable, another_variable))

Part 4: Queries

20 points

Each SQL statement should be stored in its own file: query1.sql, query2.sql, etc.

Write a SQL statement to return the symbol and name of the stock with the biggest percent gain relative to its five day average price. This should be calculated as the current price divided by the average price.
Write a SQL statement to return the name of the stock with the highest price that has less than 5 articles.
Write a SQL statement to return the symbol and name of all stocks with prices above $35 and where the absolute difference between the current price and 5 day average price is less than $1. Your results should be sorted by the absolute difference between current price and 5 day average price, in ascending order.
Write a SQL statement to return each state and number of companies headquartered in that state (no need to include states not in your dataset). Order alphabetically by state (A -> Z).

Part 5: Written Questions

10 points

Read these two articles and use them to answer the following questions in written_questions.txt

What do you think could be seen as a positive use of web scraping? What do you think could be seen as a negative use of web scraping? (2-4 sentences)
What was hiQ’s argument on why they should be allowed to scrape Linkedin? What was LinkedIn’s argument on why hiQ should not be able to scrape LinkedIn? (1-2 sentences)
In September 2019, a U.S. court ruled in favor of hiQ, stating that "Scraping data from a website likely doesn’t violate anti-hacking laws as long as the data is public." Essentially, "the Ninth Circuit’s ruling would appear to affirm that it is us that owns our data. Any platforms we share that data with are merely licensed to use it, they don’t own it outright."
Do you agree or disagree with this decision? Why or why not? (2-4 sentences)
In the case of public social media accounts like LinkedIn, Twitter, or Yik Yak, do you think it’s ethical to scrape public user posts/data and use it for either data science research, for-profit, or law enforcement purposes? Why or why not? Does it matter what purpose the data is used for? (3-5 sentences)

If you're interested in reading more on this topic, here's a recent news article related to web scraping! (This is completely optional)

Handing In

Your ~/course/cs1951a/scraping directory must contain the following:

assignment.py
data.db
query1.sql
query2.sql
query3.sql
query4.sql
written_questions.txt

Then run: cs1951a_handin scraping to submit the files in that directory.

Credits

Made with ♥ by Jacob Meltzer and Tanvir Shahriar (2019 TAs), updated by Natalie Delworth and Nazem Aldroubi (2020 TAs)

Assignment #2 « Data Scraping and Cleaning »