Multiple Linear Regression

The multiple linear regression model uses the following features to try and predict the continuous target of 'box office': country, genre, runtime, production and rating. The features need to be numerical values, thus we had to transform production and rating (the only remaining non-numerical values). We combined the 2016 and 2017 datasets into one dataframe in order to one-hot encode them with the pandas function get_dummies. The years had to be combined before one-hot encoding to ensure that the test and train datasets had the same number of features. After one-hot encoding these features, the data frame was filtered by years so that we can train the model on the 2016 dataset and test the model on the 2017 dataset. Linear regression from sklearn was used to create the model. It had a 0.6195 training score and a 0.4431 testing score.