1. Introduction
Statistical computing has rapidly developed in recent years. With the rise of data science as an academic field, the advancement of the “open” movement (encompassing source code, science, replication and reproducibility, technology, and development), and the increasingly distributed world both in terms of collaboration and computing, statistical computing is officially in a brave new world.
The aim of this paper is twofold. First, we want to underscore the evolution of statistical computing from the perspective of development. How people develop, how they work together, and how statistical computing is pushed forward as a field are all questions based on new developments. The first section, then, is focused on the “how” of the development of statistical computing.
Second, we are interested in the question of “what” as it relates to advances in statistical computing. More practically, given that we have a sense of how people develop and deepen statistical computing and related techniques, the second section pivots to address the developments in precisely what is being developed in this modern landscape. In the second section, we will cover broad realms of techniques, technologies, and applications relating to modern advancements in statistical computing.
In sum, we believe the current moment of statistical computing and all of its associated developments open a door to an exciting world of open, democratized development and work that has never before existed at such a large scale. To maximize this incredible potential, we must embrace these advances and adopt these new ways of developing into our workflows, team formations, and collaborative efforts.
2. The “How”: New Ways of Working
In the past few years, the process of developing statistical computing tools and techniques has drastically changed. This can be seen in many realms, including team-based development, decentralized collaborations, and more broadly through the open science movement as it relates to data science as a maturing field in its own rite. As such, the “how” as it relates to the development of statistical computing is multifaceted and constantly in flux. Only when we understand and adopt these new developments will we be able to maximize the great potential of modern applications and advancements of statistical computing.
Before continuing, it is important to begin this first section with a caveat. Though much of what follows in this paper outlining our thoughts on the current state of the field addresses concepts and topics relating to many adjacent fields (e.g., software engineering), we are focused primarily on statistical computing; broadly defined as the notion of writing programs of all shapes and sizes to solve statistical tasks. While the perspectives and points we raise throughout can and often do apply widely to fields beyond statistical computing, not everything may. Additionally, as a result, we limit the assumptions and implications of what we discuss to the world of statistical computing. We leave it to the reader to expand and adopt elsewhere as they would like.
2.1. Open Development
The last decades have seen a shift from professional software created by for-profit companies to free software. Under the former model, mathematical advancements and statistical decisions used to be made fairly independently of code implementation, creating a relatively well-defined boundary between statisticians (belonging to the field of mathematics) and programmers (belonging to computer science) tasked with software implementation. Nowadays, the boundaries are much more blurred, with code implementation becoming an integral part of statistical computing. With this model change from proprietary to free and open-source software came strong benefits, such as financial savings for institutions and individuals. However, it raised the question of a sustainable development model: how to incentivize and reward software developers when their product is free?
Interestingly, concomitant technological developments were able to provide a solution. The rise of open-access online software development platforms, such as GitHub and SourceForge, enabled developers to post their code or piece of software publicly and let other users re-use and contribute to it. In such a setting, benefits are widely distributed. Developers can share and collaborate on code and projects in a way previously unimaginable. Hosts of repositories or pieces of software, which are implicitly extending the offer to contribute given the open hosting of the software, benefit from the pooled resources of experts and niche specialists who can contribute to aspects of the project. The result is a collaborative piece of software that has received the attention (and critical assessment) of numerous and diverse domain experts. The second-order benefit of this arrangement is that this development is accomplished in a free, transparent way. The contributor is rewarded with a proof of participation and a demonstration of skills. The host, then, is rewarded with a better result than would have been impossible without this level of access and collaboration. Moreover, the development of crowdfunding/sponsorship initiatives allows monetary forms of contribution, opening, in principle, the door to full open-development careers. While this model naturally carries some risks of fueling precarious freelance developer positions, it is nonetheless a disruptive business model for professional software development whose full impact is yet to be seen.
Beyond the benefits to developers and software hosts/project leaders, open development contributes more broadly to science in the form of greater reproducibility and transparency [
1]. This wave, which has taken on a life of its own in the form of “open science”, is addressed further in the following section. However, at present, it is useful to point out the link between open and widespread collaboration that is native to open development, and the benefits flowing to all involved and beyond to science as a whole. This recent wave of open development, then, can be thought of as a tangible expression of an advancement that touches many fields from software development, statistics, and data science, to medicine, engineering, and the social sciences.
2.2. Open Science
As elaborated in the previous section, recent years have witnessed the formation and expansion of the open science movement in virtually every corner of science, research, and development [
2]. We can characterize the open science movement as a dedication to openly and ethically designing research studies, and then carrying them out accordingly, making code and replication data freely available. Implicit is the desire to democratize the research enterprise, where any interested scholar is encouraged to test, critique, and even challenge the merits, claims, and inferences of a study. This brand of scientific advancement has its roots in the earliest days of scientific research with the publishing of the first scientific journals as far back as the 17th century during the Scientific [
3]. In the modern conception of open science, which is a function of the earliest versions of sharing scientific information and research findings, there is a need to lay bare all aspects of the study, from the design, materials, methods, data, and code. In so doing, the development of a network of like-minded scholars building more directly on each other’s work takes [
4].
Like open development, there are many benefits that emerge from open science. Most notably, open science represents a move away from closed and often isolated research practices, where access to findings and processes are closely guarded secrets. By shifting toward an open scientific approach, more widespread sharing of ideas is possible, which benefits the careers and reputations of the researchers who advance the ideas in the first place [
5].
Beyond the career benefits of open science, the move away from closed science to open science represents a shift in how scientific ideas are shared, and as a result how the contours of the modern scientific landscape are evolving. When ideas and findings are more widely shared, the opportunities for more and diverse voices to enter the conversation are concurrently widened.
Importantly, though, with every development comes the potential for negative effects and downsides. Open science is not immune to this. For example, efforts to move toward open science have at times resulted in reinforced inequality, especially in STEM professions [
6]. Further, through a process of “platform capitalism”, some have suggested the flaws in the scientific process that open science seeks to remedy are instead “re-engineered”, or simply shifted and reinforced [
7]. As a result, this line of logic suggests that open science simply covers the existing flaws without changing or fixing anything. Still, while biases, divisions, and inequalities have and do exist in the realm of open science, the broader push to make research processes more transparent and democratized is still, at its core, a very beneficial shift in how scientific research is accomplished.
2.3. Open Access
Closely related to open science is the open data sub-movement; that is, making the material from projects openly and freely available to the public. One of the prime values of openly sharing data is for transparency and replication purposes. There is an increased demand for and expectation of making study data available, which allows for results to be replicated, and high standards of research to be maintained. In fact, many journals are beginning to require data to be made publicly available if the paper is accepted for publication. Common outlets for open data storage and hosting include the Open Science Foundation [
8] and the Harvard Dataverse.
Beyond data warehousing, open access and its impacts on open science are very practically seen in the launching of many new journals, such as the Journal of Open Source Software (JOSS), the R Journal, or SoftwareX. These journals are characterized by a renewed approach to traditional publishing, including ease of submission, transparency of reviewing process, and accessibility.
JOSS is an example worth highlighting, as it acts as a template for all of the themes addressed thus far in our paper on openness and collaboration. JOSS fully leverages the features of GitHub to use it as a platform where storage, submission, reviewing, and publishing takes place, reducing their maintenance cost and successfully enabling a diamond open access publishing model, with no cost for the author or the reader. Further, paper and software reviewers are welcomed in the same spirit as collaborators on a piece of software hosted on GitHub. This review and publication cycle is an excellent illustration of how multiple facets of open science symbiotically integrate, from open development to open access publishing. Additionally, of note, traditional and longstanding journals are also embracing openness in their publication through offering open access publication of articles, albeit often at a large cost to the researcher. While the ripples created by the open science wave are significant and notable, finding reasonable, widely-agreed-upon and fair solutions to old and new problems is still—to follow programmers’ vernacular—a WIP (work in progress). Nonetheless, the followers of the open movement(s) seem well equipped and eager to take these challenges on. Continual advances in this realm are expected, and positive outcomes can realistically be hoped for.
In conclusion, statistical computing’s future seems likely to be linked with broader ideological movements. Naturally, the most prescient is open science, driven by an implicit demand for transparency and democracy that also manifests across other fields—notably politics, economics, and other social science subfields. That being said, we also expect other currents and issues to further shape the development of statistical computing; for instance, that of “slow science”, environmentalism, and social justice. It would not be surprising to witness the emergence of formalized trends, such as “slow computing” (influenced by economic ideas of “degrowth” and a focus on individual wellbeing), “green computing” (i.e., defined by sustainability and eco-friendliness), “inclusive computing” (with an emphasis and focus on social justice), and a deepening of “affective computing” and “social computing” (both with emphasis on the impact to and role of the individual in computational endeavors). For example, the latter is increasingly becoming formalized with the advent of the new IEEE open journal, the Journal of Social Computing. As a result, as so often occurs, we expect technological innovations to fuse with new and reenergized mindsets to affect the “how” of statistical computing as much as the “what”, discussed in the following section.
3. The “What”: New Techniques and Approaches
Parallel to the wave of open science, another revolution has taken the world by storm, which is directly related to statistical computing: data science. The field of data science, which has roots in multiple subjects, is now developing into a mature standalone discipline. This can be seen through the establishment of new journals, schools, and degree programs at all levels, from bachelors to doctoral. Further, many research institutes are appearing dedicated to advancing this burgeoning field at times within their discipline (e.g., the Harvard Ophthalmology Clinical Data Science Institute), generically in service of data science as its own field (e.g., the New York University Center for Data Science), or even in the context of new aspects of the field, such as justice and ethics (e.g., the University of Virginia’s Center for Data Ethics and Justice).
Data science and open science, then, have exerted substantial influence on statistical computing through the very process of developing computational techniques. That is, to perform data science, statistics must be engaged in virtually all aspects of the process, which includes both development and application of statistics as well as computing for implementation of techniques and tools to serve the project’s end. As techniques and tools are developed, they are increasingly developed in an open way to encourage wider engagement with the tools, as well as to encourage wider contributions from the broader “open” community. This can be most clearly seen through collaborative software development, as previously discussed.
With the advancement and development of data science and the open science movement, statistical computing is simultaneously reaping the benefits of these wider communities and advancing at a fast rate and in new, larger-scaled ways. While we briefly mention some of the new areas in the following sections, this list is by no means exhaustive, and many exciting innovations and development paths are taking place in parallel.
3.1. Artificial Intelligence
Leveraging the ever-increasing amount and availability of data produced and recorded through online interactions, artificial intelligence (AI), and more specifically machine learning (ML), are areas where enormous advances have been made. Applications are wonderfully diverse, from task-specific applications (e.g., [
9,
10,
11,
12]), to larger-scale ecosystems covering every part of a data modeling pipeline, from making sense out of messy data to building predictive models, all within a unified software interface such as H2O [
13,
14,
15], scikit-learn [
16], or tidymodels [
17]. Despite the ease of use of these new technologies, the latter underscores a current point of tension in statistical computing, as the field is split around polyvalent, easy-to-use, fast-to-build tools and languages; and low-level languages or dedicated, sometimes model-class-specific, ecosystems used for production or for particular applications.
Given the tension which often accompanies any realm experiencing rapid development and advancement, attempts have been made to unify research, exploration, and accessibility with production, application, and efficiency. A recent and successful example is the development of the Julia language [
18], which is framed as solving the so-called “two-language problem”: the fact that many scientific programs are prototyped in a slow but flexible language and then reimplemented in faster but less flexible languages for practical applications.
Another interesting aspect of AI-related developments is the direct impact on computing itself. Some ML and AI advances are even reciprocally benefitting programming capabilities in the form of development assistants (e.g., GitHub copilot, automated code review tools, and more recently the impressive and somewhat unexpected coding abilities of ChatGPT all of which carry the promise of increasing productivity and optimizing developers’ work.) As a note, we are aware of the uncertainty, drawbacks, and fear at times relating to ChatGPT and similar technologies, especially in an academic setting [
19]. However, for present purposes, we are focused instead on the advances of software and statistical computing, all of which carry both benefits and drawbacks. As a practical example, referring back to JOSS, as well as new software review outlets, such as
ROpenSci [
20,
21], software review processes are substantially eased by the inclusion of automated bots, which is a sample area where this reciprocal impact is clear.
3.2. Bayesian Estimation
While machine learning is leveraged to realize incredible payoffs when it comes to building predictive models and pipelines, another area of development worth mentioning involves reforming the process of inference and uncertainty quantification: the Bayesian approach. In the modern expression of the Bayesian world, development arises in a terrain of reconsidering the value and approach to null-hypothesis significance testing (NHST). Not only does the Bayesian framework provide alternative methods for extracting meaning and making decisions about data (e.g., by providing alternative indices such as the Bayes Factor), it also changes the way we think about and quantify uncertainty as we estimate parameters while building complex models.
The development of Bayesian methods on the algorithmic side also parallels the growth of Bayesian-inspired models of how the brain works, which is revolutionizing neuroscience (see [
22]). This line of research connects biological intelligence with AI, and efforts are thus being made to optimize Bayesian estimation processes (which are typically computationally expensive) to improve or extend AI capabilities. The bidirectional influence applies here too, as AI research, such as into convolutional neural networks (CNN), is helping scientists in many areas of research, from neuroscientists attempting to better understand the brain and test neurocognitive theories (see [
23] for a recent example linking deep learning with psychological manifestations such as hallucinations), to political scientists uncovering election fraud (see [
24] for a clever application of CNN to reveal systematic voting fraud in the 1988 Mexican presidential election), and social scientists with new applications of methods such as Bayesian kriging for geospatial modeling (see [
25] for a computational exploration of Bayesian kriging in the big data era, published in this Special Issue of
Mathematics).
Central to widespread Bayesian adoption is the (relatively) recent development of APIs to easily create and sample from Bayesian models within domain-general languages. Some prominent advances and examples include: brms [
26] and rstanarm [
27] for R; pymc3 [
28] for Python; or Turing [
29] for Julia.
3.3. Results Communication
The aforementioned developments in algorithms, techniques, libraries, and approaches are complemented by concurrent and notable progress in the area of results communication. Central to the development of this aspect of modern and recent advances in statistical computing is clear reporting, wide accessibility, and ease of translating technical concepts into aesthetically pleasing, well-formatted reports with minimal effort. This advancement can be clearly seen when comparing the former industry standard (LaTeX) with the modern industry standard for technical reporting (markdown, regardless of flavor, e.g., GitHub, R, Quarto, etc.). In our opinion, this is the final piece of the puzzle to achieve open, reproducible, and high-quality statistical computing, with wide accessibility and easy consumption of research findings and technical output.
Advancements in technical reporting of this variety come in several forms: data visualization, advanced tables, and machine-generated human-readable technical reporting. First, data visualization is now a major area of focus in statistical computing, data science, and ML. Blurring the boundaries between scientific visualizations and art, the advent of initiatives to promote beautiful and informative graphs (e.g., tidytuesdays on Twitter) and generative art (e.g., artworks of Thomas Lin Pedersen or Danielle Navarro), coupled with recent pushes from major journals to favor visual over tabular rendering of findings when possible, have pushed this formerly niche corner of statistical computing to be widely accepted and pursued, with higher quality now expected. The working implementation of the grammar of graphics in ggplot2 [
30] has introduced a new API to plotting libraries and has inspired many counterparts in other languages (e.g., plotnine in Python or Gadfly in Julia). Recent developments, such as D3.js [
31], plotly, and shiny [
32] have further contributed to advancements of data visualization by introducing interactivity, offering the user the ability to “experiment by themselves” and explore the data as they see fit.
Besides figures, tools for advanced table creation allow the presenting of numerical results in an appealing and accurate way. Software with this scope exists in all the major statistical computing languages, such as gt [
33], reactablefmtr [
34], and knitr [
35,
36] in R, and in Python, PrettyTable, PrettyHTMLTable, and even pandas via DataFrame.to_html.
Figures and tables are specific ways of communicating results, but are individual parts of a broader process of creating technical reports papers and scientific papers. Report generation of this sort is facilitated by tools that allow for more transparency and reproducibility by automatizing parts of the standardized text (e.g., values in parentheses that provide details of a statistical test). Of note, the recently developed software has allowed for automating effect size labeling [
37], describing statistical models (e.g., [
38]), or clarifying the approach used for outliers treatment (see Theriault et al., published in this Special Issue of
Mathematics). Another tool used for more accurate statistical reporting is “statcheck” [
39], a tool developed to allow for checking existing documents for accurate reporting of tests, which is useful for reviewing others’ or one’s own work.
A final, but important refinement worth mentioning that ties in with several points made throughout aims at making results more readable, aesthetically pleasing, and ultimately understandable, all in a reproducible and easy-to-manage way. This is the fruit of recent document-generation frameworks that are able to combine code (which can be from multiple languages), figures, and text into well-formatted outputs. Recent examples of these software tools include Quarto and RMarkdown [
40], which can be combined with Shiny for cloud-based reporting [
35], and officer [
41]. In Python, similar packages for reporting include pandas [
42], Jinja2 [
43], and WeasyPrint, to name a few. Further, similar libraries to those mentioned in R and Python include Weave [
44] and Pluto [
45] in Julia.
4. Concluding Remarks
Statistics was once exclusively seen as a component of mathematics, and its practitioners were required to be trained and familiar with mathematical concepts, formulas, and equations. Statisticians could develop entire theories and frameworks in isolation, and report their ideas in traditional scientific outlets. However, the path of statistics is now in the process of synchronizing with computer science, as good software development and efficient algorithm design become key to statistical advances.
To answer the question of where we are going with statistical computing posed in the title of this paper, we suggest that decentralized collaboration (embracing and pushing forward open science in all its aspects), cross-pollination of experts in multiple fields working in multiple languages, and a focus on users will increasingly characterize this field. Namely, to the latter group, the term “users” is an ever-widening concept, which includes many people with many purposes. For instance, users may include the lay user interested in writing better technical documents, the statistician–scientist interested in publishing reproducible, well-formatted statistical results, or the operational data scientist collaborating with internal stakeholders on developing new ways of computing and sharing findings across their team.
Whether contributing original ideas or enjoying the benefits emerging from the modern network of collaborative and open development, statistical computing is at the center of performing and sharing good science in reproducible ways. Additionally, most thrillingly, there is no end in sight to the development and evolution of statistical computing.