How much money a movie makes is a large factor in whether it will be deemed successful or not? A model that could predict its box office amount would be extremely useful to movie franchises.
Predicting an exact box office gross income can be challenging since it is continuous data. Changing our question from a continuous outcome to a categorical outcome could prove insightful and potentially be a more accurate model.
Originally, we wanted to predict the box office gross income for movies being released in 2020, though as we dove into the data we realized there was not enough public information to create a meaningful model. We began to look at previous years that contained more complete information. We settled on using 2016 as our train data set and 2017 as our test data set for our model.
There were three phases to creating the necessary datasets to run our models. We began by web scraping to create comprehensive lists of movies released each year. Next, data was compiled through API calls to the OMDb API. Finally, the data was cleaned and transformed to meet the needs of our models.
A logistical regression model was created to answer our categorical question: Is the box office gross income going to be in the top 20 for the year? A multiple linear regression model was created to answer our continuous question: Can we predict the box office gross income? After creating our two models we began to examine the advantages and disadvantages of one model over another. This is shown in our comparison page.
Throughout this venture, we have found limitations as well as further steps that need to be taken.