We used a web scraping script to collect data which did not appear in the existing dataset: for example, information about the release date, budget, US gross revenue and world gross revenue. Using information that was contained in the available (and callable) IMDB datasets (able as crews, filterCombined, filteredTitle, namebasics, principals etc.) we executed several joins with the scraped data to create a collective database. Twitter data was collected from a combination of publicly available tweets and tweets harvested from Twitter using its API and subsequently joined to the database as well.
IMDb has a very large, complete, and well-known dataset, so the source is reputable. Twitter is likewise reputable with regard to the integrity of its tweets.
We are collecting the information on movies that came out after 2008. Eleven years data is not a small dataset but it still can have some bias. Because we collected data since 2008, it is representative of people's taste for movies since 2008. There could be some kinds of movies that were preferred by people before 2008, but less so after 2008. After 10 years, though, such movies could be popular again because of nostalgia; then, if predicting this kind of movie, it will have some bias that people already lose their interest in it (which is not true).
IMDb datasets are objective datasets, so we consider that it does not skew in some direction. But Twitter data should include some personal preference, as individuals have different tastes about various kinds of movies. And there also could be some comments tailored for doing advertising. In addition, comments could be made simply because he/she likes some specific actors in a movie instead of liking that movie itself etc; these will provide challenges for the construction of our model.
We have around 88213 movies as data. As IMDb does not provide some movies’ gross, and movies’ with low budget, low revenue information, our model later will have some bias toward those movies. For now, we do not want to remove the entries with low budget and 0 gross information, since it will further influence our model. And it does not evenly divided into two sets as low budget and high budget, because there are a lot of data do not have budget information as 0, so it will be taken into considering for now as low budget. We discussed with professor and since she suggested we keep as much data as possible so we keep all the data near zero.
Since IMDb does not provide some budget and gross information for some minority movies, we will be missing value as gross and budget for those movies. It will have some influence about predicting movie’s box office. We would hopefully like to rectify this the best we can with additional sources.
As we cleaned data and sort by movie id, we do not have any duplicates.
The data is distributed as skewed as a lot of data are surrounding 0. The following scatterplot shows the data distribution. All the minimum value are 0, for the budget maximum value is 356,000,000, US gross is 936,662,225 and world gross is 2,797,800,564. (X axis is the same as the title; Y axis is counts).
First, some data as \\N and /NaN dismatch so we change them all to /NaN type. Then we have not decide to use int or string for “runtimeminutes”, for now the data type is string.
We are considering but not sure about it. We want to remove some zero data for budget and gross since there are many movies that IMDb does not provide any information. But we keep those data for now, because it will influence the analyses as some minority kinds of movies do not have a huge gross and budget, but it does not mean it is not popular or will not be popular.
First, we have found several datasets from IMDb datasets, but since the data does not include budget, gross and averagrating information, we needed to write a scraper to get the information from the IMDB website. Also, when we did the data cleaning part, there is was a large amount of dirty data that needed to be cleaned. Since there are more than 80,000 data points, we wrote code to separate batches, so that each of us could run separately then combine all the data together.
The next step is to use the sentiment analysis from Twitter to build a machine learning model taking data as input, then train the model to predict the movie’s box office revenue. Using Twitter API calls to get the tweets proved challenging for several reasons. Most notably, there is a lot of Twitter space to search for tweets -- even for well-known celebrities, this is still similar to finding a needle in a haystack. This causes a task like this to require a lot of Twitter API calls, which quickly depletes your API call allotment. Given that premium APIs are more expensive, this poses a challenge for most effectively designing your data collection pipeline. A related challenge is that, for the free Twitter APIs, the data range on which you could search seemed to be restricted. Improving upon the current set of sentiment data with more effective and high-throughput Twitter API calls and supplemental available public tweets is a clear goal going forward.
We orginized all our data into google drive file "data_final" and related files.