Where is the data from?

How did you collect your data?

We used a web scraping script to collect data which did not appear in the existing dataset: for example, information about the release date, budget, US gross revenue and world gross revenue. Using information that was contained in the available (and callable) IMDB datasets (able as crews, filterCombined, filteredTitle, namebasics, principals etc.) we executed several joins with the scraped data to create a collective database. Twitter data was collected from a combination of publicly available tweets and tweets harvested from Twitter using its API and subsequently joined to the database as well.

Is the source reputable?

IMDb has a very large, complete, and well-known dataset, so the source is reputable. Twitter is likewise reputable with regard to the integrity of its tweets.

How did you generate the sample? Is it comparably small or large? Is it representative or is it likely to exhibit some kind of sampling bias?

We are collecting the information on movies that came out after 2008. Eleven years data is not a small dataset but it still can have some bias. Because we collected data since 2008, it is representative of people's taste for movies since 2008. There could be some kinds of movies that were preferred by people before 2008, but less so after 2008. After 10 years, though, such movies could be popular again because of nostalgia; then, if predicting this kind of movie, it will have some bias that people already lose their interest in it (which is not true).

Are there any other considerations you took into account when collecting your data? This is open-ended based on your data; feel free to leave this blank. (Example: If it's user data, is it public/are they consenting to have their data used? Is the data potentially skewed in any direction?)

IMDb datasets are objective datasets, so we consider that it does not skew in some direction. But Twitter data should include some personal preference, as individuals have different tastes about various kinds of movies. And there also could be some comments tailored for doing advertising. In addition, comments could be made simply because he/she likes some specific actors in a movie instead of liking that movie itself etc; these will provide challenges for the construction of our model.

How clean is the data? Does this data contain what you need in order to complete the project you proposed to do? (Each team will have to go about answering this question differently, but use the following questions as a guide. Graphs and tables are highly encouraged if they allow you to answer these questions more succinctly.)

How many data points are there total? How many are there in each group you care about (e.g. if you are dividing your data into positive/negative examples, are they split evenly)? Do you think this is enough data to do what you hope to do?

We have around 88213 movies as data. As IMDb does not provide some movies’ gross, and movies’ with low budget, low revenue information, our model later will have some bias toward those movies. For now, we do not want to remove the entries with low budget and 0 gross information, since it will further influence our model. And it does not evenly divided into two sets as low budget and high budget, because there are a lot of data do not have budget information as 0, so it will be taken into considering for now as low budget. We discussed with professor and since she suggested we keep as much data as possible so we keep all the data near zero.

Are there missing values? Do these occur in fields that are important for your project's goals?

Since IMDb does not provide some budget and gross information for some minority movies, we will be missing value as gross and budget for those movies. It will have some influence about predicting movie’s box office. We would hopefully like to rectify this the best we can with additional sources.

Are there duplicates? Do these occur in fields that are important for your project's goals?

As we cleaned data and sort by movie id, we do not have any duplicates.

How is the data distributed? Is it uniform or skewed? Are there outliers? What are the min/max values? (focus on the fields that are most relevant to your project goals)

The data is distributed as skewed as a lot of data are surrounding 0. The following scatterplot shows the data distribution. All the minimum value are 0, for the budget maximum value is 356,000,000, US gross is 936,662,225 and world gross is 2,797,800,564. (X axis is the same as the title; Y axis is counts).

Are there any data type issues (e.g. words in fields that were supposed to be numeric)? Where are these coming from? (E.g. a bug in your scraper? User input?) How will you fix them?

First, some data as \\N and /NaN dismatch so we change them all to /NaN type. Then we have not decide to use int or string for “runtimeminutes”, for now the data type is string.

Do you need to throw any data away? What data? Why? Any reason this might affect the analyses you are able to run or the conclusions you are able to draw?

We are considering but not sure about it. We want to remove some zero data for budget and gross since there are many movies that IMDb does not provide any information. But we keep those data for now, because it will influence the analyses as some minority kinds of movies do not have a huge gross and budget, but it does not mean it is not popular or will not be popular.

Summarize any challenges or observations you have made since collecting your data. Then, discuss your next steps and how your data collection has impacted the type of analysis you will perform. (approximately 3-5 sentences)

First, we have found several datasets from IMDb datasets, but since the data does not include budget, gross and averagrating information, we needed to write a scraper to get the information from the IMDB website. Also, when we did the data cleaning part, there is was a large amount of dirty data that needed to be cleaned. Since there are more than 80,000 data points, we wrote code to separate batches, so that each of us could run separately then combine all the data together.

The next step is to use the sentiment analysis from Twitter to build a machine learning model taking data as input, then train the model to predict the movie’s box office revenue. Using Twitter API calls to get the tweets proved challenging for several reasons. Most notably, there is a lot of Twitter space to search for tweets -- even for well-known celebrities, this is still similar to finding a needle in a haystack. This causes a task like this to require a lot of Twitter API calls, which quickly depletes your API call allotment. Given that premium APIs are more expensive, this poses a challenge for most effectively designing your data collection pipeline. A related challenge is that, for the free Twitter APIs, the data range on which you could search seemed to be restricted. Improving upon the current set of sentiment data with more effective and high-throughput Twitter API calls and supplemental available public tweets is a clear goal going forward.

We orginized all our data into google drive file "data_final" and related files.

Google drive