Our models are only as good as our datasets and our datasets were limited in size. For each year we webscraped more than 3,000 titles, but after eliminating those without data for ‘box office’ we were left with about 250 titles per year. This is a large reduction in our data size. Furthermore, when we further investigated this large gap in the dataset, we found our box office numbers did not match numbers we found while researching movie-specific box office gross incomes. This could be that the other numbers found online were box office gross incomes for multiple countries, over a longer period of time, or only included opening weekend numbers. Our data's accuracy is also limited by the accuracy of the original webscraped data. In webscraping, a few 2015 movies were returned but had been included in the 2016 list of movies.
Another limitation was the limited API calls we were allowed. Given our time-table, we only had the time and resources to compile data for 2016-2018. We had planned to use 2018 to validate our model but it had even less data for 'box office' than the 2016 and 2017 datasets. Instead of having approximately 250 records for ‘box office’, it had only 25. With such a dramatic drop in size, we decided it would not be meaningful to use the 2018 dataset to validate our model. In hindsight, we wish we had pulled the data for 2015 instead of 2018. However, we had no way of knowing that it would have such limited data for 'box office'.
Our current model has very promising results, though we are curious to see if it would still be as accurate with larger and more complete data sets. Given more time, we could compile complete datasets for 2000 – 2015, thus enabling us to train and test the model with 2000 – 2017 data versus our current scope of 2016 -2017 data. In addition, training and testing different models outside of linear and logistical would be beneficial.
Over the past two weeks, we have worked to compile more data to see how our models would do. We compiled data for 2013, 2014 and 2015. This enabled us to now train on 2013 to 2016 compared to only 2016 prior. We still used 2017 to test the models. For the logistical model, we originally had a train score of 0.917 and a test score of 0.903. When training on 2013 - 2016, we had a score of 0.909; it slightly decreased. Though it tested with a score of 0.917. Thus or model with the additional years was testing better than the train.
The original multiple linear regression model trained on 2016 with a score of 0.619 and tested on 2017 with a score of 0.443. When we expanded our train to 2013-2016, the score slightly increased to 0.465. When we tested on 2017, the score was -7.260. The only explanation that we could come up with for it being a negative number was that it was raised to such a large exponent that the program could no longer display it. Essentially it was zero. We think the model performed so much worse over a longer time period because the noise increased dramatically in the model. Furthermore, it is amplfying our lack of financial data about the movies that most likely has a larger effect and would act a better predictor for box office amount. We would be interested in trying to reducce the noise and pursing a more complex model to see if that would help improve it such as a Decision tree or Random Forest Models.