Assignment 7 « Natural Language Processing »

Brief

Due date: April 23, 2020 at 11:59pm
Stencil: cs1951a_install nlp

cp -r /course/cs1951a/pub/nlp/stencil ~/course/cs1951a/nlp

Data: cp -r /course/cs1951a/pub/nlp/bbc ~/course/cs1951a/nlp
Handin: cs1951a_handin nlp
Files to submit: nlp.py and README.txt

Overview

Oftentimes, we will have to work with a text-based dataset instead of a numerical one. Being able to make sense of this data is of great interest to us for such tasks such as sentiment analysis (determining whether a review was positive or negative), text summarization, and translation between languages. This collection of methods is called Natural Language Processing (NLP).

In this assignment, we’ll be going over topic modeling, which is an algorithm that analyzes a collection of documents (also known as a corpus) to determine what each document is about.

Assignment

In this assignment, we will be sorting different documents according to 5 categories: business, entertainment, politics, technology, and sports. In order to sort them, you will implement a topic modeling algorithm to group documents based on these categories.

Note that in this assignment, we'll be using many library functions that rely on randomness such as shuffling and KMeans. In order to preserve the "randomness" of these algorithms while still making the results deterministic (meaning that you'll get the same results every time), you should always set random_state=0 whenever you can. The stencil code will let you know when you should be setting a random state, and failure to do so will result in a point deduction. You can read more about this here.

In addition, the libraries that we'll be using (NLTK and scikit-learn) are extremely well documented and also have a lot of examples on use them. As a result, this handout will not go over how to use the NLTK/scikit-learn functions for the sake of brevity.

Part 1: Data Cleaning

5 points

Text data usually requires more preprocessing than numerical data. This section will outline several techniques used to process text. Even though there are many steps, the NLTK (Natural Language Toolkit) library thankfully does most of the work for us.

For this part of the assignment, you'll be filling out a function called process_document which performs the following operations on a document:

Consistent Casing

When dealing with text data, it’s important to convert our text to either all lower case or all upper case (though typically lower cased is used for legibility). This is to ensure that we don’t mistakenly treat two words with different casing but with the same characters as distinct.

For instance, we want to treat “Data”, "DATA", and “data” all as the same word, which we can do by converting all of our text to lower case.

Tokenization

Tokenization refers to the act of breaking up a document into discrete words. This is more complicated than simply splitting on all whitespace.

If we just split on whitespace, then the document (with all words converted to lower case) “i like to sleep. sleep is good.” won’t be tokenized correctly since “sleep.” and “sleep” will be treated as two distinct words.

There are many other edge cases that can come up during tokenization. NLTK comes with tokenizers that we can use, but we’ll be using the RegexpTokenizer to remove any non-alphanumeric characters, such as punctuation and whitespace.

Removing Common Words

When working with text data, we oftentimes want to remove common words that don’t contribute much to the meaning of sentences such as “the” and “is”. These words are referred to as stop words.

This is especially important for topic modeling, which relies on using word frequencies to generate the topics (more on this later). Since stop words appear so often, if we didn’t remove them, all of our topics may be determined by stop words, which isn’t helpful since these words don’t have much semantic meaning and typically aren’t specific to a certain topic.

Stemming

Consider words like “warmer” and “warmest.” Even though they’re technically distinct words, they share a common root: “warm”, so it makes sense to consider them all the same in NLP.

This process of removing word suffixes is referred to as stemming. Just like with tokenization, NLTK comes with built-in stemmers. For this assignment, we’ll be using the SnowballStemmer, which algorithmically removes word suffixes based on orthography (spelling).

Part 2: Data Loading

40 points

Using the process_document function written in Part 1, we can now read in all of our documents. Fill out the read_data function, which returns four things:

documents: a mapping from document IDs to a bag of words (see the "Bag of Words Representation" subsection below)
document_word_counts: a mapping from words to number of documents it appears in, NOT the total number of times it appears across all documents
word_ids: a mapping from words to unique integer IDs
labels: a mapping from document IDs to labels (one of 'e', 'p', 't', 's', or 'b', for entertainment, politics, technology, sports, or business, repsectively). You can get this label from the filename variable. Each file is named using following scheme: L_XXX.txt where L is a one character label and XXX is a three digit ID. Do NOT use these IDs when calculating documents though, since these IDs are not guaranteed to be unique across the different labels (i.e. we can have both a b_001.txt and s_001.txt

Bag of Words Representation

For this algorithm, we can disregard syntax (word order) and treat each document as a bag of words, meaning that after preprocessing our text, we can represent each document as a dictionary mapping words to their counts.

For example, if we had this document:

Belle likes to read. Beast likes to throw tantrums.

Our bag of words (disregarding preprocessing) would look like this:

{
    'Belle': 1,
    'likes': 2,
    'to': 2,
    'read': 1,
    'Beast': 1,
    'throw': 1,
    'tantrums': 1
}

Part 3: Topic Modeling

20 points

To complete this part, follow the TODOs in the LSA function to fill it in.

Topic modeling allows us to classify a text document into a certain topic. To give a concrete example, consider this piece of text:

It is called high-definition - HD for short - and it is already hugely popular in Japan and the US. It is set, according to analysts, to do for images what CDs did for sound. Different equipment able to receive HD signals is needed though and is expensive. But Europe's gamers may be the early adopters to drive demand. Europeans will have to wait until at least 2006 until they see mainstream HDTV.

If asked what this piece of text is about, you may say it's about TVs or technology but not about food or fashion. This is in essence the heart of what we are trying to do in topic modeling.

We will group documents from a corpus of BBC (British Broadcasting Corporation) news articles into 3 of 100 different possible topics each. Keep in mind that we specify that our model should generate 100 topics, but since computers don’t have an innate understanding of language, there’s no easy way for us to specify what these 100 topics should be. The model will instead figure out 100 general categories that the words fall under, and looking at the words in each category can give you a sense of what each topic might be. There’s no way for us to truly know what each topic represents, however. In other words, instead of giving us concrete topics like “sports” and “animals,” we just end up with 100 abstract topics: topic0, topic1, …, topic 99.

While there are several ways performing of topic modelling, we are going to be implementing one called Latent Semantic Analysis.

Latent Semantic Analysis (LSA)

tf-idf Scores

Once we have each document represented as a bag of words, we need to calculate term frequency-inverse document frequency (tf-idf) scores, which is a metric to measure how important each word is in a document.

We can store our tf-idf scores in a matrix (W) in which the number of rows (r) is equal to the total number of documents and the number of columns (c) is equal to the number of words that we have across all of our documents. This means that we need to assign each document a unique integer id between 0 and r-1, inclusive, and each word a unique integer id between 0 and c-1, inclusive.

tf-idf scores are composed of two parts: the term frequency (tf) score and the inverse document frequency (idf) score. Given a document i and word j, the tf score for W_i,j is equal to the number of times word j appears in document i divided by total number of words in document i. As the name term frequency implies, this just measures how often word j appears in document document i, relative to the other words in document i.

However, just because a word appears frequently doesn't necessarily mean that it's important to the meaning of the document. While we want to find the words that appear frequently in a document from the tf score, we also want to give more weight to words that appear infrequently in other documents. For example, if we had an article about avocado prices in the US, words like "avocado" and "US" might both have high tf scores, but we want to weigh "avocado" higher since it's a rarer word overall (less likely to appear in any given document) and thus is more likely to contribute to the overall meaning of the document.

This is why we also need to calculate an inverse document frequency (idf) score to figure out which words are rare across our entire set of documents. The idf score of a word j is equal to log(total number of documents / number of documents containing j). Note that log here refers to natural log, not base-10 log.

We then multiply the tf and idf scores to calculate the tf-idf score for W_i,j.

Matrix Factorization

Now that we have our matrix of tf-idf scores, we can transform it using matrix factorization (hint: refer to the matrix factorization lab!). This will give us a new matrix in which the rows represent documents and the columns represent topics from which you can use the elements with the highest values to figure out which topics correspond to which documents.

In other words, the shape of the matrix is (number of documents, number of topics) and the topic that most strongly corresponds to a document d will be the argmax of the row of d.

Since linear algebra is not a pre-requisite for this class, we will not be going over the math behind this, nor will you be expected to understand how it works, but you can read more here if you're interested!

Finally, we want to assign each document's top "topics_per_document" topics to that document's ID, according to our new matrix, using a dictionary.

Part 4: Classification

15 points

Now that we've processed our data and assigned topics to each document, we can finally train a classifier! We'll be exploring four different classification algorithms: k-Nearest Neighbors (KNN), decision trees, support vector machines (SVM), and multi-layer perceptrons (a basic form of a neural network).

We won't be asking you to implement any of these algorithms from scratch. Instead, we'll be using the scikit-learn (sklearn) library's implementations.

More Data Processing

Since classification is a supervised machine learning algorithm, we'll need training data and testing data. sklearn has a function called train_test_split that will split our data for us. Please use a split of 90% training and 10% testing. It also conveniently can also shuffle our data for us, which you should do with a random state of 0.

Training and Evaluation

Once we have training and testing data, we can finally train our models and see how they do. Train an instance of each of the four different algorithms listed above, and see how well they do on the testing data.

For KNN, set the number of neighbors used to classify to be 3. You can use the default parameters for the other three classifiers (with the exception of random_state).

Follow the instructions and complete the classify function.

The scores of these classifiers are printed in the main function. You will not be graded on these scores, however, if your score is too far off, it might be from an error in an another portion of the assignment that you will be graded on. For reference, this is the performance of the TA solution: K-Nearest Neighbors Accuracy: 0.673, Decision Tree Accuracy: 0.771, SVM Accuracy: 0.422, Multi-Layer Perceptron Accuracy: 0.395

Part 5: Clustering

10 points

What if we didn’t know which categories our documents fell into? If we wanted to put our documents into 4 groups (without using any labels), we need to come up with a new way to organize them.

Because we no longer have labels for our data, we need to switch from classification, which is a supervised algorithm, to clustering, which is unsupervised.

Recall from your previous assignment that you can use the k-means algorithm in order to group together similar inputs. We’ll be using k-means again in order to cluster our documents based on their topics in order to figure out which documents are similar to each other.

We’ll be using the KMeans class from the sklearn library to do this. Use k-means in order to predict which cluster each of the documents is in, as well as to calculate the center of each cluster. Then use the provided plot_clusters function to visualize your clusters.

Part 6: Written Questions

10 points

In February 2019, OpenAI created a transformer-based language-generator model called GPT-2 that is capable of generating near human-level writing ability.

There has been continuous debate on whether releasing this model was irresponsible since it could be able to auto-generate fake news. However in November, OpenAI decided to publish the full version of the model (previously, only much smaller model architectures were released). Read the following article and respond to the written questions in the file README.txt.

You can try out the model here. Try inputting in different types of paragraphs - poems, emails, news articles, etc. How good is the model at generating new text? Are there any mistakes? (2 points).
What types of positive and negative effects might come from use of this algorithm? Do you think that OpenAI should have released this model? Why or why not? (8 points)

Handing In

The folder you hand in must contain the following:

nlp.py
README.txt - containing anything you'd like the TAs to know about your handin as well as the answers to the written questions.

Then run: cs1951a_handin nlp.

Do not hand in the data files!

Credits

This is a new assignment created by Alex Jang (ajang) in Spring 2019, and updated by Nam Do (ndo3) and Maggie Wu (mwu27) in Spring 2020!

Source for the BBC News dataset: D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.