Next Article in Journal
The Investment Home Bias with Peer Effect
Next Article in Special Issue
A Machine Learning Integrated Portfolio Rebalance Framework with Risk-Aversion Adjustment
Previous Article in Journal
Financial Compass for Slovak Enterprises: Modeling Economic Stability of Agricultural Entities
Previous Article in Special Issue
A Gated Recurrent Unit Approach to Bitcoin Price Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Finding Nemo: Predicting Movie Performances by Machine Learning Methods

1
Statistics Discipline, Division of Science and Mathematics, University of Minnesota-Morris, Morris, MN 56267, USA
2
Department of Biostatistics and Data Science, University of Texas Health Science Center, Houston, TX 77030, USA
3
Department of Marketing, California State University, Los Angeles 5151 State University Dr, Los Angeles, CA 90032, USA
4
Department of Big Data and Statistics, Cheongju University, Chungbuk 28503, Korea
5
Askew School of Public Administration and Policy, Florida State University, Tallahassee, FL 32306-2250, USA
*
Author to whom correspondence should be addressed.
J. Risk Financial Manag. 2020, 13(5), 93; https://doi.org/10.3390/jrfm13050093
Submission received: 11 April 2020 / Revised: 30 April 2020 / Accepted: 2 May 2020 / Published: 9 May 2020
(This article belongs to the Special Issue Machine Learning Applications in Finance)

Abstract

:
Analyzing the success of movies has always been a popular research topic in the film industry. Artificial intelligence and machine learning methods in the movie industry have been applied to modeling the financial success of the movie industry. The new contribution of this research combined Bayesian variable selection and machine learning methods for forecasting the return on investment (ROI). We also attempt to compare machine learning methods including the quantile regression model with movie performance data in terms of in-sample and out of sample forecasting.

1. Introduction

The movie industry has been growing over the several decades which is a global phenomenon. Competition in the global box office market is becoming increasingly complex, according to the annual report of the Motion Picture Association of America. The expansion of the movie market and the competition encourages the production of research from various approaches. Legoux et al. (2016) showed that a movie with excellent reviews has a greater chance to remain longer in a theater when compared to one with poor, fair, or good reviews, even after controlling for the previous week’s box office revenue. The establishment of a highly accurate model to predict the success of a movie is required for industrial decision makers. These decision makers aim to reduce the probability of making false decisions in the green-lighting process—the process to formally approve movie production. The forecasting of movie success is not easy because the movie industry often depends on complex issues such as social and economic factors. Therefore, previous research employed various methods for film producers and distributors to predict the economic success of a film. Sharda and Delen (2006) considered MPAA Rating, competition, star value, genre, special effects, sequel, number of screens at the initial day of release by using logistic regression discriminant analysis, classification regression tree, and neural networks.
Eliashberg et al. (2009) employed classification, regression tree and neural networks with movie script. Lee and Chang (2009) employed Bayesian belief network and causal belief network with early box-office data, release season, box-office revenue. Zhang and Skiena (2009) employed multilayer BP neural networks with nation, director, performer, propaganda, content category, month, week, festival, competition, cinema number, screen number. Du et al. (2014) employed support vector machine (SVM) and neural networks with microblog posting counts and content. Lash and Zhao (2016) suggested a decision support system to help movie investment decisions at the early stages of movie productions by using social network analysis and a text mining technique—the system extracted several sets of features automatically, including “who” are on the cast, “what” a movie is about, “when” a movie will be released, and “hybrid” features that match “who” with “what” and “when” with “what” for predicting movie profitability. Ho et al. (2017) investigated the probability that an individual-level decrease in preference over time is due to the well-known decrease in a movie’s revenue after opening. Machine learning research is a well employed method and has been repeatedly used to build prediction models by Du et al. (2014) and Lee et al. (2018). Lee et al. (2018) used an ensemble approach, which had rarely been used in predicting box office performance. Machine learning can provide systematic support for decision-making so that Galvão and Henriques (2018) performed the profit of a movie through neural networks, regression and decision trees. Kim et al. (2017). performed box office forecasting considering competitive environment and word-of-mouth in social networks in Korean film market. Lu (2019) analyzed qualitative and quantitative analytic hierarchy process method to establish the movie box office prediction model, in combination with the actual data of the Chinese film market. Holesh (2019) tried to find a pattern of film performance correlated by genre, charted these film performances by genre and by year, and showed by employing regression analysis that consumers do have an expected response to certain genres over others. Zhang et al. (2019), Hur et al. (2016), and Kim et al. (2015) used social network analysis and text mining for movie industry analysis. Oh et al. (2017) showed that online consumer engagement behavior (CEB) affects future economic performance so that CEB on Facebook and YouTube positively correlate with movie box-office revenue, and social media-based CEB is critical to improve the economic performance of movie firms. Çağlıyor et al. (2019) aimed to design a forecast model using different machine learning algorithms such as support vector regression (SVM), artificial neural networks (ANN), decision tree regression (DT) and linear regression (LR) to estimate the theatrical success of US movies in Turkey before their market entry. Liu and Xie (2019) and Quader et al. (2017) also used Machine learning for the prediction of box office.
The previous researches have focused to produce divergent results by avoiding machine learning because past researchers might have concentrated on building new algorithms and methods of classification rather than focusing on the interpretation of findings. In this study, we will use a Bayesian variable selection method to select important variable to ROI which has not been studied in the previous movie industry researches. With the selected important variables, we analyze and compare quantile regressions, multivariate adaptive regression splines, support vector machine, and neural network methods to form an accurate prediction model of ROI using major film forecasting variables (such as the number of theater screenings, number of running weeks, critics’ reviews, production budget, and genres). So, the contribution of our research proposes our method combining Bayesian variable selection and machine learning methods which includes quantile regression when we have an extremely skewed ROI data because there are many films with a low ROI, and some are very successful.
The layout of the article is as follows. In Section 2, we describe the Hollywood data we collected. In Section 3, we describe the linear model and machine learning methods to model ROI, such as adaptive regression splines, support vector machines (SVM) and neural network. In Section 4, we begin Bayesian variable selection to choose the important variables for ROI and apply the selected variables to machine learning methods for modeling ROI. Then, we compare the proposed machine learning methods in terms of mean absolute percentage error (MAPE). In Section 5, concluding remarks are presented.

2. Data and Description

In order to perform the analysis, we rely mainly on information concerning 2010–2015 movie titles and genres collected from IMDb. Corresponding information regarding box office performance, critics’ reviews, and production budget were retrieved from Box office mojo and Meta critic. The complete data set uses a total of 719 movies categorized under 24 distinct film genres.
The descriptions of the employed variables are shown below:
  • Genre: The category of a particular movie. Multiple classifications are available.
    (Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Horror, Musical, Mystery, Romance, Thriller, War, Western)
  • Running Weeks: The length of a theater run for a particular movie, given in weeks.
  • Box Office: The total revenue (United States Dollar) of a particular movie from U.S. domestic theaters.
  • Number of Theaters: The total number of theaters screening a particular movie.
  • Meta Score: A weighted average score of published critic reviews of a particular movie.
  • Budget: The total production cost (United States Dollar) of a particular movie.
Figure 1 shows the roadmap of data analysis. As depicted in Figure 1, Steps 1 through 4 are data extraction, data preprocessing, data integration, and feature selection. Then, a regression analysis is performed.
Table 1 shows the summary statistics of the key variables used in the analysis. There were 719 movies between 2010 and 2015. The mean number of audiences per movie was 7.6 million with a standard deviation of 9.95 million. The IQR was 9.5 million for the audiences. The mean of total revenue of a particular movie (i.e., Box Office) was $61.3 million with a standard deviation of $80.9 million. The mean of total production cost (Budget) was $47.5 million with a standard deviation of $51.9 million. Thus, on average, each movie generated an operating income of $13.6 million. The mean metascore was 51.61 and the mean number of theaters that showed a movie was 2253. Most variables used in the analysis demonstrated highly skewed distributions.
As seen in Table 2, the “R” rating is the most frequent (n = 330) rating for movies, followed by “PG-13” (n = 280) between 2010 and 2015. These two ratings accounted for about 85 percent of all ratings. There are very few movies with “NC-17” or “G”.
Since the data size has become larger, complicated, and highly correlated among many variables, machine learning research has been very popular over the last two decades because machine learning techniques can be applied to big, complicated, and highly correlated data, which has been a difficult issue to be dealt with using generalized linear regression methods. Recently, many different variants of machine learning techniques have been applied to the economic success of the movie industry, but machine learning research on the economic success of the movie industry were mostly focused on classification methods. So, we want to propose machine learning regression methods for modeling the financial success of movies. This is the strong research motivation in this paper.
With the movie data described in this session, we define the financial success of the movie industry as ROI (return on investment) with box office and budget variables as follows:
R O I = B o x o f f i c e B u d g e t B u d g e t × 100 %
In terms of the film industry’s marketing viewpoint, we focus on modeling ROI with important variables selected by the Bayesian variable selection method. The higher the ROI is, the more profitable a movie is, and vice versa.
Table 3 shows that the bottom 25 percent of ROI has a negative number. In Figure 2, the earning rate is skewed to the right. This means that most movies earn in the low range of the ROI, with a few exceptions that are distributed on a large range (long “tail”) of the higher ROI.

3. Bayesian Variable Selection and Machine Learning Methods

We use the Bayesian variable selection and statistical machine learning methods in this research. We apply the Bayesian variable selection method to Hollywood movie data. In this Section, we briefly explain the Bayesian variable selection. Objective Bayesian methods for hypothesis testing and variable selection in linear models are considered in Garcia-Donato and Forte (2018). Garcia-Donato and Forte (2018) introduce the usage of specific functions to compute several types of model averaging estimations and predictions weighted by posterior probabilities. BayesVarSel contains exact algorithms to perform fast computations in problems of small to moderate size and heuristic sampling methods to solve large problems. So, we applied GibbsBvs function with gZellner prior, the number of iterations = 10,000 and the number of burninng = 1000 in ‘BayesVarSel’ R package [Garcia-Donato and Forte (2018)] to the described variables in the Section 2.
Quantile regression is an extension of the classical regression that offers information on the whole conditional distribution of the response variable. If in the classical regression case the goal is to approximate the conditional mean, in quantile regression the focus is to approximate the conditional quantile functions of a response variable Y given a set of variables X. The quantile regression model can capture the information associated with the location, scale and the shape shift of the conditional distribution, it is useful when heteroskedasticity is involved and in homogeneous regression models where the usual parametric assumptions do not hold. No error distribution is imposed in quantile regression. Quantile regression estimators have the equivariance property as the ordinary least square estimators but the equivariance to monotone transformations is specific only to quantile regression. Davino et al. (2014) provide excellent sources for various properties of quantile regression as well as many computer algorithms.
Friedman (1991) introduced multivariate adaptive regression splines (MARS) which is a non-parametric regression technique that automatically simulates nonlinearities and interactions between variables. MARS builds models of the form
f ^ ( x ) = i = 1 n C i B i ( x )
where the model is a weighted sum of the base functions B i and C i , which are constant coefficients. To apply MARS to Hollywood movie data, we used earth function with default in ‘earth’ R package.
Smola and Schölkopf (2004) described that the SVM algorithm is a nonlinear generalization of the Generalized Portrait algorithm in (Vapnik and Lerner 1963; Vapnik and Lerner 1963; Hastie et al. 2009). In terms of this industrial film context, SVM research has been a good modeling direction for predicting the economic success of a film. In machine learning, SVMs are supervised learning models related to learning algorithms that analyze data used for classification and regression analysis. To apply SVM Regression to Hollywood movie data, we used ksvm function with default in ‘kernlab’ R package. We used the radial basis function kernel, or RBF kernel, which is a popular kernel function used in various kernelized learning algorithms, especially in support vector machine classification.
We also set cost parameter to be 5. While the greater cost parameter penalizes large residuals, the resultantly decreased bias offers a more flexible model with fewer misclassifications. The cross-validation error is 3.
Kaur and Nidhi (2013) built a mathematical model for predicting the success class, i.e., flop, hit, super hit, of Indian movies. In order to accomplish this, Kaur and Nidhi (2013) developed a methodology in which the historical data of each part (e.g., actor, actress, director, music) that affects the success or failure of a movie is given in weight and age and then based on multiple thresholds computed on the basis of descriptive statistics of the dataset of each component. It is then given a class (flop, hit, super hit) label. Then the dataset is subjected to a neural network-based learning algorithm for automating the process. The results in terms of a match between actual class labels and predicted labels are evaluated. The results indicate that the strategy of recognizing the class of success is extremely effective and accurate, which is obvious from the classification matrix. In machine learning or cognitive science, an artificial neural network (ANN) is a network inspired by biological neural networks which are used to estimate functions that can rely on a great number of inputs that are unknown. To apply single-hidden-layer neural network to Hollywood movie data, we used nnet function with single layer with five neurons in ‘nnet’ R package. We set the size number of neurons in the hidden layer to be 20 for 2010–2015 years data and to be 10 with 2010–2014 in this paper. We set the decay parameter for weight decay to be 1 and switch for linear output units.

4. Empirical Results

In this section, we want to compare the traditional linear regression requiring several assumptions that we previously mentioned and the popular machine learning methods for modeling ROI by in-sample forecasting and out of sample forecasting.
We select the most important predictive variables that determine ROI by using one of the most popular machine learning methods, the Bayesian variable selection method. Figure 3 and Table 4 display the importance level of predictors for ROI during years 2010–2015. More useful variables achieve higher accuracy.
Based on the Bayesian variable selection method, we selected the following three important variables as audiences, theaters and horror for the explanatory variable to output variable ROI modeling during years 2010–2015.
The histogram of ROI during Years 2010–2015 in Figure 2 show that there is an extremely skewed distribution. There are many films with a low ROI, and some are highly successful. The traditional regression analysis is not appropriate with this Hollywood data so that we used quantile regression (QR) such as 25th quantile regression (QR25), 50th quantile regression (QR50) and 75th quantile regression (QR75).
In Table 5, the outputs of quantile regression clearly show that ROI will be statistically increased as the more theaters increasing because the formula of ROI is based on two variables (Budget and Box Office). The interesting findings from QR50 and QR75 in Table 5 are that ROI will be statistically significantly increased with the increase of the horror genre and that intercept is statistically positive significant to ROI, which means the average of ROI during years 2010–2015 increased.
From Table 6, Neural network (NNet) model has the smallest RMSE (root-mean-square error) value for ROI for Year with 2010–2015 Years data (in-sample forecasting) compared with the values of RMSEs of QR25, QR50, QR75, MARS and SVM. In terms of in-sample forecasting, the machine learning methods such as MARS, SVM and NNet are superior than quantile regression. Especially, NNet is the best among MARS, SVM, and NNet with this Hollywood data.
By the Bayesian variable selection method, we also selected the most important predictive variables that determine ROI, Table 7 and Figure 4 display the Audiences and Theaters variables for the explanatory variable to output variable ROI modeling during years 2010–2014.
In Table 8, the outputs of quantile regression clearly show that ROI will be statistically increased as the more theaters increasing. However, the interesting finding from QR25 and QR50 in Table 8 are that Intercept is negatively statistical significant to ROI. This means the average of ROI during years 2010–2014 was decreased.
We also divided two data sets which are train data (years 2010–2014) and test data (year 2015) to compare the forecasting prediction accuracy with QR25, QR50, QR75, MARS, SVM, and neural network models. For a measure of prediction accuracy of a forecasting method, we employed the mean absolute percentage error (MAPE) used as a loss function for regression problems in machine learning. The formula of MAPE is defined as
M A P E = 1 n i = 1 n | O i P i O i |
where O i is the actual value and P i is the forecast value. The absolute value in this formula is summed for every forecasted point in time and divided by the number of fitted points n.
In Table 9, among those six models above, we can clearly see that QR50 model has the smallest MAPE compared with the other five models (QR25, QR75, MARS and SVM and NNet) in terms of ROI. 7. To perform the graphical comparison of forecasts by each model, we used boxplots of the absolute percentage errors for each model in Figure 5. Table 10 shows that QR50 model has the smallest median and interquartile range (IQR) among the seven forecasting models. The results in Table 10 conformed to Figure 5.
To do the statistical tests to show the differences between models for MAPEs of ROI, we use Wilcoxon rank sum test and median test. For the Wilcoxon rank sum test, we rank all N observations. The sum W of the ranks for the first sample is the Wilcoxon rank sum statistic. If the two populations have the same continuous distribution, then W has mean
μ W = n 1 ( N + 1 ) 2
and its standard deviation is
σ W = n 1 n 2 ( N + 1 ) 12 .
The Wilcoxon rank sum test rejects the hypothesis that the two populations have identical distributions when the rank sum W is far from its mean.
When the distribution may not be normal, we state the hypotheses in terms of population medians rather than means.
H 0 : m e d i a n 1 = m e d i a n 2
H a : m e d i a n 1 m e d i a n 2
In Table 11, we used Wilcoxon rank sum test and median test to show the differences between QR50 and one of other six models with the absolute percentage errors of ROI for each model. Table 11 shows that there are statistically differences between QR 50 and one of the five models (QR75, MARS and NNet), but there is not statistical difference between QR 50 and QR25 or QR 50 and SVM by both Wilcoxon rank sum test and median test.
In Table 9, we showed that QR50 has the smallest MAPE compared with the other five models (QR25, QR75, MARS, SVM and NNet) in terms of ROI. In Table 10, QR50 has the smallest median and IQR of the absolute percentage errors of ROI among six forecasting models. Therefore, in terms of out of sample forecasting for ROI, we can conclude that the QR50 model is superior than the QR25, QR75, MARS, SVM, and NNet models, even though the MAPEs of QR25 and QR50, SVM, and QR50 are not statistically significant at the 5% significance level.

5. Conclusions

We employed modern statistical methods to Hollywood movie data. Rather than using all variables in our data, we used the selective and important predictive variables for ROI by using the Bayesian variable selection method. By performing this approach, we can avoid not only the possible measurement error in the Hollywood dataset, but also the unnecessary statistical conditions such as multicollinearity and independence among the explanatory variables for ROI. Our results showed that the neural network Model for ROI is overall superior to the well-known machine learning methods in terms of RMSE for in-sample forecasting and the median quantile regression model for ROI is overall superior to the well-known machine learning methods in terms of MAPE for out of sample forecasting. For future research, we will apply the quantile regression and machine learning methods to the Hollywood movie keyword count data generated by the text mining technique to obtain the relationship between movie title keywords and ROI.

Author Contributions

Conceptualization, I.K., S.L., and J.-M.K.; methodology, J.-M.K., S.L.; software, L.X. and J.-M.K.; validation, S.L. and J.-M.K.; formal analysis, L.X., S.L. and J.-M.K.; investigation, I.K. and K.-H.L.; resources, I.K. and J.-M.K.; data curation, I.K., L.X. and S.L.; writing—original draft preparation, L.X., S.L., I.K., K.-H.L., and J.-M.K.; writing—review and editing, I.K., J.-M.K., and K.-H.L.; visualization, L.X.; supervision, S.L., and I.K.; project administration, K.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

We are thankful to two anonymous referees for their meaningful comments and constructive suggestions that have improved the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Çağlıyor, Sandy, Başar Öztayşi, and Selime Sezgin. 2019. Forecasting Box Office Performances Using Machine Learning Algorithms. In International Conference on Intelligent and Fuzzy Systems. Cham: Springer, pp. 257–64. Available online: https://link.springer.com/chapter/10.1007/978-3-030-23756-1_32 (accessed on 30 April 2020).
  2. Davino, Cristina, Marilena Furno, and Domenico Vistocco. 2014. Quantile Regression: Theory and Applications. Hoboken: Wiley. [Google Scholar] [CrossRef]
  3. Du, Jingfei, Hua Xu, and Xiaoqiu Huang. 2014. Box office prediction based on microblog. Expert Systems with Applications 41: 1680–89. [Google Scholar] [CrossRef]
  4. Eliashberg, Jehoshua, Quintus Hegie, Jason Ho, Dennis Huisman, Steven J. Miller, Sanjeev Swami, Charles B. Weinberg, and Berend Wierenga. 2009. Demand-driven scheduling of movies in a multiplex. International Journal of Research in Marketing 26: 75–88. [Google Scholar] [CrossRef] [Green Version]
  5. Friedman, Jerome H. 1991. Multivariate Adaptive Regression Splines. The Annals of Statistics 19: 1–67. Available online: https://projecteuclid.org/euclid.aos/1176347963 (accessed on 30 April 2020). [CrossRef]
  6. Galvão, Marta, and Roberto Henriques. 2018. Forecasting Movie Box Office Profitability. Journal of Information Systems Engineering & Management 3: 22. [Google Scholar] [CrossRef] [Green Version]
  7. Garcia-Donato, Gonzalo, and Anabel Forte. 2018. Bayesian Testing, Variable Selection and Model Averaging in Linear Models using R with BayesVarSel. The R Journal 10: 155–74. [Google Scholar] [CrossRef] [Green Version]
  8. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer, Available online: https://link.springer.com/book/10.1007/978-0-387-84858-7 (accessed on 30 April 2020).
  9. Ho, Jason Y. C., Robert E. Krider, and Jennifer Chang. 2017. Mere newness: Decline of movie preference over time. Canadian Journal of Administrative Science 34: 33–46. [Google Scholar] [CrossRef]
  10. Holesh, Michael Thomas. 2019. Forecasting Consumer Preference of Film Genre. Capstone Project. Durham: Duke University, Available online: https://hdl.handle.net/10161/18944 (accessed on 30 April 2020).
  11. Hur, Minhoe, Pilsung Kang, and Sungzoon Cho. 2016. Box-office forecasting based on sentiments of movie reviews and Independent subspace method. Information Sciences 372: 608–24. [Google Scholar] [CrossRef]
  12. Kaur, Arundeep, and A. P. Nidhi. 2013. Predicting Movie Success Using Neural Network. International Journal of Science and Research 2: 69–71. Available online: https://pdfs.semanticscholar.org/540f/933f3e5acbcd6874ccf38d513d5f04536b42.pdf (accessed on 30 April 2020).
  13. Kim, Taegu, Jungsik Hong, and Pilsung Kang. 2015. Box office forecasting using machine learning algorithms based on SNS data. International Journal of Forecasting 31: 364–90. [Google Scholar] [CrossRef]
  14. Kim, Taegu, Jungsik Hong, and Pilsung Kang. 2017. Box Office Forecasting considering Competitive Environment and Word-of-Mouth in Social Networks: A Case Study of Korean Film Market. Computational Intelligence and Neuroscience 2017: 4315419. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Lash, Michael T., and Kang Zhao. 2016. Early Predictions of Movie Success: The Who, What, and When of Profitability. Journal of Management Information Systems 33: 874–903. [Google Scholar] [CrossRef]
  16. Lee, Kyung Jae, and Woojin Chang. 2009. Bayesian Belief Network for Box Office Performance: A Case Study of Korean Movies. Expert Systems with Applications 36: 280–91. [Google Scholar] [CrossRef]
  17. Lee, Kyuhan, Jinsoo Park, Iljoo Kim, and Youngseok Choi. 2018. Predicting movie success with machine learning techniques: Ways to improve accuracy. Information Systems Frontiers 20: 577–88. [Google Scholar] [CrossRef] [Green Version]
  18. Legoux, Renaud, Denis Larocque, Sandra Laporte, Soraya Belmati, and Thomas Boquet. 2016. The effect of critical reviews on exhibitors’ decisions: Do reviews affect the survival of a movie on screen? International Journal of Research in Marketing 33: 357–74. [Google Scholar] [CrossRef]
  19. Liu, Yan, and Tian Xie. 2019. Machine learning versus econometrics: Prediction of box office. Applied Economics Letters 26: 124–30. [Google Scholar] [CrossRef]
  20. Lu, Wei. 2019. Research on Movie Box Office Prediction Model with AHP Method. Paper presented at the 2019 2nd International Conference on Information Management and Management Sciences, Chengdu, China, August 23–25; pp. 177–81. [Google Scholar] [CrossRef]
  21. Oh, Chong, Yaman Roumani, Joseph K. Nwankpa, and Han-Fen Hu. 2017. Beyond likes and tweets: Consumer engagement behavior and movie box office in social media. Information & Management 54: 25–37. [Google Scholar] [CrossRef]
  22. Quader, Nahid, Md Osman Gani, Dipankar Chaki, and Md Haider Ali. 2017. A machine learning approach to predict movie box-office success. Paper presented at the 2017 20th International Conference of Computer and Information Technology (ICCIT), Dhaka, Bangladesh, December 22–24; pp. 1–7. [Google Scholar] [CrossRef] [Green Version]
  23. Sharda, Ramesh, and Dursun Delen. 2006. Predicting box-office success of motion pictures with neural networks. Expert Systems with Applications 30: 243–54. [Google Scholar] [CrossRef]
  24. Smola, Alex J., and Bernhard Schölkopf. 2004. A Tutorial on Support Vector Regression. Statistics and Computing 14: 199–222. Available online: https://link.springer.com/article/10.1023/B:STCO.0000035301.49549.88 (accessed on 30 April 2020). [CrossRef] [Green Version]
  25. Vapnik, Vladimir, and Alexander Lerner. 1963. Pattern recognition using generalized portrait method. Automation and Remote Control 24: 774–80. [Google Scholar]
  26. Zhang, Wenbin, and Steven Skiena. 2009. Improving Movie Gross Prediction through News Analysis. Paper presented at the 2009 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Milan, Italy, September 15–18; vol. 1, pp. 301–4. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.225.154&rep=rep1&type=pdf (accessed on 30 April 2020).
  27. Zhang, Xin-Jie, Yong Tang, Jason Xiong, Wei-Jia Wang, and Yi-Cheng Zhang. 2019. How Network Topologies Impact Project Alliance Performance: Evidence from the Movie Industry. Entropy 21: 859. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Diagram of Data Analysis.
Figure 1. Diagram of Data Analysis.
Jrfm 13 00093 g001
Figure 2. Histogram of ROI during Years 2010–2015. Unit of Horizontal Axis is Percent (%) and Vertical Axis is Frequency.
Figure 2. Histogram of ROI during Years 2010–2015. Unit of Horizontal Axis is Percent (%) and Vertical Axis is Frequency.
Jrfm 13 00093 g002
Figure 3. Bayesian variable selection of ROI with Movie Data (2010–2015).
Figure 3. Bayesian variable selection of ROI with Movie Data (2010–2015).
Jrfm 13 00093 g003
Figure 4. Bayesian variable selection of ROI with Movie Data (2010–2014).
Figure 4. Bayesian variable selection of ROI with Movie Data (2010–2014).
Jrfm 13 00093 g004
Figure 5. Boxplots of the Absolute Percentage Errors of ROI for Each Model for Year 2015 with 2010–2014 Years data (out of sample forecast).
Figure 5. Boxplots of the Absolute Percentage Errors of ROI for Each Model for Year 2015 with 2010–2014 Years data (out of sample forecast).
Jrfm 13 00093 g005
Table 1. Summary Statistics during Years 2010–2015.
Table 1. Summary Statistics during Years 2010–2015.
# Obs.MeanStd. Dev.25th Perc.75th Perc.IQR
Audiences7237,606,0889,950,034791,48610,275,9419,484,454
Box Office72361,318,76080,950,2776,323,29782,806,14476,482,847
Budget72347,548,91151,981,04911,000,00064,500,00053,500,000
Metascore72351.6116.7739.0063.0024.00
Theaters7232252.721368.85781.003284.502503.50
Table 2. Rating Information during Years 2010–2015.
Table 2. Rating Information during Years 2010–2015.
RatingGPGPG-13RNC-17
Frequency91012823301
Proportion0.01240.13970.390.45640.0014
Note: G = General Audiences, PG = Parental Guidance Suggested, PG-13 = Parents Strongly Cautioned, R = Restricted (under 17 requires accompanying parent or adult guardian), and NC-17 = Adults Only.
Table 3. Summary Statistics for ROI during Years 2010–2015. Unit is Percent (%).
Table 3. Summary Statistics for ROI during Years 2010–2015. Unit is Percent (%).
ROIStatistics
Sample Size (n)719
Mean755.8263
Std. Dev.1559.7506
Variance2,432,822.029
Range1559.7506
25th percentile−65.5669
75th percentile105.1934
IQR170.7602
Table 4. Bayesian variable selection method for ROI during Years 2010–2015.
Table 4. Bayesian variable selection method for ROI during Years 2010–2015.
VariableInclusion ProbabilityHPMMPM
Audiences0.6733**
Metascore0.0646
Theaters0.5878**
Weeks0.0451
Action0.0752
Adventure0.1473
Animation0.0471
Children0.0421
Comedy0.0356
Crime0.0347
Documentary0.0398
Drama0.0521
Fantasy0.0647
FilmNoir0.0386
Horror0.5474**
Musical0.0381
Mystery0.0454
Romance0.0421
SciFi0.0370
Thriller0.0514
War0.0407
Western0.0394
Note: HPM stands for Highest posterior Probability Model and MPM for Median Probability Model. * means statistically significant at the 5% significance level.
Table 5. Quantile Regression of ROI with 2010–2015 Years data.
Table 5. Quantile Regression of ROI with 2010–2015 Years data.
25th Percentile RegressionEstimateStandard Errort-Value
(Intercept)−98.695015.56383−17.73869
Audiences0.000000.000006.43183
Theaters0.012090.002934.12331
Horror0.7620114.302230.05328
Median RegressionEstimateStandard Errort-Value
(Intercept)−83.030216.62744−12.52824
Audiences0.000000.000006.79277
Theaters0.016990.003574.76616
Horror16.5163317.490500.94430
75th Percentile RegressionEstimateStandard Errort-Value
(Intercept)−1.0658316.67615−0.06391
Audiences0.000010.000003.72178
Theaters0.005080.010160.49990
Horror363.79949109.553343.32075
Table 6. RMSE of ROI with 2010–2015 Years data (in-sample forecast).
Table 6. RMSE of ROI with 2010–2015 Years data (in-sample forecast).
ModelsQR25QR50QR75MARSSVMNNet
RMSE1581.8931576.1731555.2381420.8171308.1361179.089
Table 7. Bayesian variable selection of ROI with Movie Data (2010–2014).
Table 7. Bayesian variable selection of ROI with Movie Data (2010–2014).
VariableInclusionProbabilityHPMMPM
Audiences0.9985**
Metascore0.0816
Theaters0.9970**
Weeks0.0400
Action0.0424
Adventure0.2340
Animation0.0491
Children0.0413
Comedy0.0496
Crime0.0450
Documentary0.0429
Drama0.0611
Fantasy0.0802
FilmNoir0.0346
Horror0.0612
Musical0.0428
Mystery0.0506
Romance0.0471
SciFi0.0414
Thriller0.0386
War0.0498
Western0.0400
Note: HPM stands for Highest posterior Probability Model and MPM for Median Probability Model. * means statistically significant at the 5% significance level.
Table 8. Quantile Regression of ROI with 2010–2014 Years data.
Table 8. Quantile Regression of ROI with 2010–2014 Years data.
25th Percentile RegressionEstimateStandard Errort-Value
(Intercept)−98.599695.38054−18.34761
Audiences0.000000.000005.73186
Theaters0.012210.003014.06243
Median RegressionEstimateStandard Errort-Value
(Intercept)−79.052356.38368−12.55561
Audiences0.000000.000007.12540
Theaters0.015710.003534.54326
75th Percentile RegressionEstimateStandard Errort-Value
(Intercept)17.8776821.167890.51878
Audiences0.000010.000003.98656
Theaters−0.004840.011330.04941
Table 9. MAPE of ROI for Year 2015 with 2010–2014 Years data (out of sample forecast).
Table 9. MAPE of ROI for Year 2015 with 2010–2014 Years data (out of sample forecast).
ModelsQR25QR50QR75MARSSVMNNet
MAPE2.8663751.7925366.61166812.224413.5077845.298819
Table 10. Summary Statistics of Absolute Percentage Errors of ROI for Each Model for Year 2015 with 2010–2014 Years data (out of sample forecast).
Table 10. Summary Statistics of Absolute Percentage Errors of ROI for Each Model for Year 2015 with 2010–2014 Years data (out of sample forecast).
ModelQR25QR50QR75MARSSVMNNet
Minimum0.01300.05880.05910.07040.08660.0559
1st Quantile0.44080.53300.72100.99120.50080.5131
Median1.00800.88261.28392.19830.98081.1768
Mean1.91971.52284.51247.01723.50785.2988
3rd Quantile1.29941.10143.71774.92712.06452.5618
Maximum39.75424.17588.996102.03157.461121.833
IQR0.85860.56852.99673.93591.56382.0487
Table 11. The statistical tests to show the differences between models for the absolute percentage errors of ROI for each model.
Table 11. The statistical tests to show the differences between models for the absolute percentage errors of ROI for each model.
QR50
TestWilcoxon Rank Sum Test (p-Value)Median Test (p-Value)
QR20.47030.2772
QR750.00320.0294
MARS 0.00000040.00001
SVM0.24720.7174
NNet0.04260.0294

Share and Cite

MDPI and ACS Style

Kim, J.-M.; Xia, L.; Kim, I.; Lee, S.; Lee, K.-H. Finding Nemo: Predicting Movie Performances by Machine Learning Methods. J. Risk Financial Manag. 2020, 13, 93. https://doi.org/10.3390/jrfm13050093

AMA Style

Kim J-M, Xia L, Kim I, Lee S, Lee K-H. Finding Nemo: Predicting Movie Performances by Machine Learning Methods. Journal of Risk and Financial Management. 2020; 13(5):93. https://doi.org/10.3390/jrfm13050093

Chicago/Turabian Style

Kim, Jong-Min, Leixin Xia, Iksuk Kim, Seungjoo Lee, and Keon-Hyung Lee. 2020. "Finding Nemo: Predicting Movie Performances by Machine Learning Methods" Journal of Risk and Financial Management 13, no. 5: 93. https://doi.org/10.3390/jrfm13050093

APA Style

Kim, J. -M., Xia, L., Kim, I., Lee, S., & Lee, K. -H. (2020). Finding Nemo: Predicting Movie Performances by Machine Learning Methods. Journal of Risk and Financial Management, 13(5), 93. https://doi.org/10.3390/jrfm13050093

Article Metrics

Back to TopTop