1. Introduction
OpenAI succeeded in making artificial intelligence (AI) accessible to the world (n.b., it is only available to the population with access to the internet) and has demonstrated how generative AI (Gen-AI), a subset of deep learning, can transform our lives [
1]. As a result, since its launch in November 2022, the natural language model Chat Generative Pre-trained Transformer (ChatGPT) continues to disrupt industries across the globe [
2], with other models like Microsoft Copilot and Google’s Gemini emerging since.
Given its popularity as the first to market, researchers have already delved into many aspects of ChatGPT, from its potential impact on education [
3] and research [
4] to its impact on various fields ranging from marketing [
5] to forensics [
6], to name a few. However, the rapid adoption of Gen-AI is also highlighting its many shortcomings, which range from hallucinations [
7] to bias and ethical issues [
8] and to the negative environmental impact [
9,
10]. Furthermore, concerns about AI making certain job functions obsolete are also rapidly emerging [
11]. Therefore, the importance of promoting the use of AI for intelligence augmentation, i.e., enhancing human intelligence and improving the efficiency of human tasks as opposed to being a replacement, is crucial [
12]. In this regard, recent experimental evidence points towards an opportunity for using Gen-AI to reduce productivity inequalities [
13].
A few months after ChatGPT was launched, Hassani and Silva [
14] discussed the potential impact of Gen-AI on data science and related intelligence augmentation. Building on that work, here, we focus our attention on “forecasting”, which is a common data science task that helps with capacity planning, goal setting, and anomaly detection [
15]. Today, Gen-AI tools offer the capability for non-experts to generate forecasts and use these in their decision-making processes. Nvidia’s CEO Jensen Huang recently predicted the death of coding in a world where “
the programming language is human, [and] everybody in the world is now a programmer” [
16].
In a world where humans can now generate forecasts without an in-depth knowledge or understanding of forecasting theory, practice, or coding, we are motivated to determine whether there is a need to rethink forecasting practice concerning the benchmarks that are used to evaluate forecasting models. Benchmark forecasts are meant to have significant levels of accuracy and be simple to generate with minimal computational effort. It is an important aspect of forecasting practice as investments in new forecasting models should only be entertained if there is sufficient evidence of a proposed model significantly outperforming popular benchmarks. As outlined in [
17], when proposing a new forecasting model or undertaking forecast evaluations for univariate time series, it is important to consider the naïve, seasonal naïve, or ARIMA model as a benchmark for comparing forecast accuracy. The random walk (i.e., the naïve range of models) is known to be a tough benchmark to outperform [
18]. Exponential smoothing, Holt–Winters and Theta forecasts are also identified as benchmark methods in one of the most comprehensive reviews of forecasting theory and practice [
18].
Recent research confirms the superiority of AI models across various computational tasks by building on theories of deep learning, scalability, and efficiency [
19,
20]. As discussed, and evidenced below, these computational tasks now include forecasting using historical data. Given that large language models can generate forecasts based on AI prompts, this study is grounded by the following research question:
RQ: Should forecasts from Gen-AI models (for example, forecasts from ChatGPT or Microsoft Copilot) be considered a new benchmark in forecasting practice?
To the best of our knowledge, there exists no published academic work that seeks to propose or evaluate forecasts from Gen-AI models as a benchmark or contender in the field of forecasting. In contrast, machine learning models have been applied and compared with statistical models for time series forecasting [
21], whilst deep learning models have also received much attention in the recent past [
22]. Some studies propose hybrid forecasting models that combine machine learning, decomposition techniques, and statistical models and compare the performance against benchmarks like ARIMA [
23]. Therefore, it is evident that studies seeking to introduce benchmarks via comparative analysis of models are important. For example, in relation to machine learning, Gu et al. [
24] sought to introduce a new set of benchmarks for the predictive accuracy of machine learning methods via a comparative analysis, whilst Zhou et al. [
25] presented a comparison of deep learning models for equity premium forecasting. Gen-AI models, given their reliance on deep learning, can extract and transform features from data and identify hidden nonlinear relations without the need to rely on econometric assumptions and human expertise [
25].
Therefore, our interest lies in conducting a comparative analysis of forecasts from Gen-AI models in comparison to forecasts generated by established, traditional benchmark forecasting models to determine whether there is sufficient evidence to promote a new benchmark model for forecasting practice in the age of Gen-AI. In this paper, initially, we consider ChatGPT as an example of a Gen-AI tool and use it to forecast three time series, as an example. These include the U.S. accidental death series [
26,
27,
28], the air passengers series [
29] and UK tourist arrivals [
30,
31]. The forecasts from ChatGPT are compared with seven forecasting models which represent both parametric and non-parametric forecasting techniques and are provided via the forecast package in
R [
32]. These include seasonal naïve (SNAIVE), Holt–Winters (HW), autoregressive integrated moving average (ARIMA), exponential smoothing (ETS), trigonometric seasonality, Box–Cox transformation, ARMA errors, trend and seasonal components (TBATS), seasonal–trend decomposition using LOESS (STL), and the Theta method. Models such as SNAIVE, ARIMA, ETS, Theta, and HW are identified as benchmark forecasting models in [
17,
18], whilst the rest have the shared properties of being automated, simple, and applicable with minimum computational effort without the need for an in-depth understanding of forecasting theory. However, unlike with Gen-AI models, the application of these selected benchmarks will require some basic coding skills and an understanding of the use of the programming language
R.
Through the empirical analysis, we find that in some cases, forecasts from Gen-AI models can significantly outperform forecasts from popular benchmarks. Therefore, we find evidence for promoting the use of Gen-AI models as benchmarks in future forecast evaluations. However, our findings also indicate that the accuracy of these forecasts could vary depending on the underlying data structures, the level of forecasting knowledge, and education, which will invariably influence the quality of prompt engineering and the training data underlying the Gen-AI model (e.g., paid vs. free versions). Reliability-related issues are also prevalent, alongside Gen-AI models being black boxes and thus restricting interpretability of the models being used.
Through our research, we make several contributions to forecasting practice and the literature. First, we present the most comprehensive evaluation of forecasts from Gen-AI models to date, comparing them to seven traditional benchmark methods. Second, based on our findings, we can propose the use of Gen-AI models as benchmark forecasting models for forecast evaluations. In doing so, we add to the list of historical benchmark forecasting models in [
18], which tend to require basic programming and coding skills. Third, our research also seeks to educate and improve the basic forecasting capabilities of the public by sharing the coding used to generate competing forecasts via the forecast package in
R. Finally, through the discussion, we also seek to improve the public understanding and capability of engaging with Gen-AI models for forecasting; we share the prompts used on Microsoft Copilot that resulted in a forecast for one of the datasets.
The remainder of this paper is organized such that
Section 2 briefly introduces the forecasting models with the codes used to generate the forecasts,
Section 3 presents the forecasting results and analysis, a discussion follows in
Section 4, and the paper concludes in
Section 5.