The four batches of data for a single year were imported and then combined into a single data frame. Any movies that were not released in the United States were dropped. If the movie had been released in multiple countries, those values were kept as long as the United States was included. Any movies that did not have box office information were dropped from the data set as well. The data frame was then filtered so that each row had at least seven features that were not null. At this point the data frame was exported as cleaned_2016_data to a CSV to be used later if required. After the data was cleaned, we continued from this point to transform the data to the format we needed for out models.
First the columns were filtered so that it included only: title, box_office, country, genre, production, rating, and runtime. Next, we had to transform our data so that all of the features were numerical values. Country contained all of the countries a movie was released in and genre contained all of the genres it was categorized as. Both of these features were transformed to counts of the respective values. Rating had values of 'Not Rated' as well as 'Unrated'. We determined these were the same, and thus replaced all 'Not Rated' values with 'Unrated'. Any 'na' values in the rating column were also replaced with 'Unrated'. Runtime had 'min' included with the value and it was transformed so that it was only an integer. Any 'na' values were replaced with 0. Any null values for production were filled with 'NaN'. Box office was transformed into an integer by removing dollar signs and commas. A year column was added and the value was set based on the year it represented. This will enable us to filter by year later when it is time to model. This dataframe was exported as a CSV called 'train_data_2016' and for 2017 it was called 'test_data_2017' since 2016 will be used to train and 2017 will be used to test.
The logistical regression model followed the same cleaning as above, but required one further step. Initially, our target of 'box office' was not categorical data, but rather is continuous. Thus, we sorted the data frame by ascending order for 'box office'. We then set the top twenty box offices to a value of 1 and the rest were set to 0. This 2016 dataframe was exported as a CSV called 'train.log_data_2016' and for 2017 it was called 'test.log_data_2017' since 2016 will be used to train and 2017 will be used to test.