1. Introduction
Lightning has been an increasing cause of wildfires in recent decades. Although, historically, lightning has only caused 15–20% of wildfire occurrences in the USA, lightning-ignited wildfires account for about 60% of the annually burned acreage. In the extratropical portion of the Northern Hemisphere, lighting is responsible for 77% of the burned wildfire area every year [
1]. Hence, the capability to predict lightning 7–10 days in advance on a continental scale, at a high spatial resolution using operational, numerical weather forecasts as drivers, is a key prerequisite for any modeling efforts aimed at quantifying the chances of wildfire ignition over large regions.
Attempts to model lightning date back to 1980s [
2,
3]. These early studies recognized the importance of lightning in wildfire occurrences and that lightning’s frequency was correlated with the low-level moisture flux induced by strong upward vertical motions in the boundary layer during storms [
3]. With the advancement of numerical weather prediction (NWP) models, an effort was made to incorporate lightning into operational forecasts based on “first principles” using simulated thunderstorm parameters and the microphysical properties of clouds as drivers [
4,
5,
6]. These studies found promising relationships between lightning’s frequency and the depth of convective clouds as well as between a parameter called the Potential Electrical Energy (PEE) of clouds and noninductive charge separation, which involves collisions of graupel and ice particles in the presence of supercooled liquid water. PEE was assumed to dissipate when it exceeds preassigned threshold values, causing lightning to be generated. The Lightning Potential Index (LPI) was introduced as a new parameter in weather forecast models to measure the potential for the generation and separation of the electrical charges leading to lightning flashes in convective thunderstorms [
5]. The LPI is calculated within the charge separation region of clouds between 0 °C and −20 °C, where the noninductive mechanism involving the collisions of ice and graupel particles in the presence of supercooled water is most effective. Studies have shown that the LPI is well correlated with observed lightning. Wong et al. [
7] found that the Price and Rind lightning parameterization scheme [
4] did not produce correct distributions for lightning flashes in the context of convection-resolving forecasts produced by the Weather Research and Forecasting (WRF) model. Recently, satellite observations have revealed a strong correlation between the number of lightning strokes and the vertical distribution of ice particles, as depicted by the 94 GHz products of the Cloud Profiling Radar on board the
CloudSat (a polar satellite of NASA’s Earth System Science Pathfinder) [
8]. These data could help improve the existing mechanistic schemes of cloud electrification to predict lightning that are employed by modern NWP models.
When the goal is the generation of accurate lightning forecasts over a limited domain, the best approach is to derive region- and season/time-specific lightning activity models using multivariate statistical methods that relate the observed probability of lightning flashes/strikes to a large number of atmospheric parameters describing the vertical structure of thunderstorms. This approach typically involves machine learning algorithms such as
random forest or a combination of a stepwise regression analysis, Principal Component Analysis (PCA), and logistic regression [
9,
10,
11,
12,
13]. Our study elected to apply PCA and logistic regression to a large archive of historical lightning and meteorological data spanning 29 years over the USA (see the
Section 2 for details).
In the USA, the National Predictive Services (NPS) is an interagency group tasked with the operational delivery of 7-Day Wildfire Potential Outlooks for fire management operations. The NPS asked the Rocky Mountain Center (RMC), an applied research unit at the US Forest Service Rocky Mountain Research Station (RMRS), for Fire–Weather Intelligence to develop a system of statistical models for predicting wildfire ignition probabilities on a national grid using numerical weather forecasts by the National Weather Service as input. As a first step in the project, the RMC built a lightning forecast model. This paper describes the methodology employed by the RMC to derive a suite of regional, monthly, logistic equations capable of estimating the probability of cloud-to-ground (CG) lightning strikes, using forecast weather fields, on a 20 km grid over the continental USA (ConUS) and Alaska (AK), and the verification of these equations against independent data.
2. Materials and Methods
The overall approach of this study followed the original methodology proposed by Bothwell [
14], which was later also adopted by Buckey [
15] and Richardson [
16]. We used datasets of much longer durations to derive new and improved lightning prediction equations for the ConUS and AK.
2.1. Georeferenced Datasets
Table 1 lists the datasets utilized for the ConUS. The developmental database included two 29-year-long climatological records of weather reanalysis fields and observed lightning strikes over the ConUS. These were used to derive lightning forecast regression equations. The NOAA North American Regional Reanalysis (NARR) [
17], containing 3D fields from 00:00, 03:00, 06:00, 09:00, 12:00, 15:00, 18:00, and 21:00 UTC for the period from January 1990 to 2018, were downloaded and archived in GRIB2 format. The lightning data for the lower 48 states (ConUS) were provided by Vaisala [
18] through a Federal Contract with the Bureau of Land Management (BLM) at the US Department of the Interior. Vaisala’s historical lightning dataset, covering the period from January 1990 to December 2018, was delivered to our team by the Western Regional Climate Center at the Desert Research Institute in Reno, NV, USA. According to the signed contract between Vaisala and BLM, only US Government Agencies may use the lightning dataset free of charge for scientific research. Three-hour forecast-gridded fields extending over 10 days, produced by the NOAA Global Forecast System (GFS) [
19], were also downloaded and archived in GRIB2 format (0.25 × 0.25 degree) for the core fire season (April–September) of 2018 in order to verify our lightning forecast equations against independent data.
The development of the lightning forecast model for AK used similar datasets as the ConUS model but covered a shorter time period. NARR was used as a source of observed gridded weather fields and the Alaska Lightning Detection Network (ALDN) [
20], owned and maintained by the Alaska Fire Service (AFS) [
21], was utilized as a source of geo-referenced lightning-strike data. The training datasets for AK spanned a period of 10 years (from 2012 to 2021). The reason for this was that, during the summer of 2012, ALDN replaced Vaisala’s Impact sensors, which detect lighting flashes, with Time-Of-Arrival (TOA) sensors (provided by TOA Systems, Inc. Melbourne, FL, USA), which record lightning strikes. A comparison study conducted by the AFS, reported online (at
https://oasishub.co/dataset/alaska-historical-lightning-from-1986-to-2017-alaska-interagency-coordination-centre accessed on 25 March 2024), found that the new TOA ALDN detects about 2.25 times more lightning events than the older Vaisala sensors. About 60% of this increase is attributable to the difference in detecting strikes vs. flashes (a flash can contain multiple strikes), with the remaining 40% increase resulting from an improved detection efficiency, expanded spatial sensor coverage, and a longer detection range. Since the TOA sensors will continue to be used in the future, we decided to exclude from our analysis all lightning flash data collected by the Vaisala Impact sensors prior to 2012.
2.2. Statistical Methods and Procedures of Model Development
Lightning strikes are binomially distributed events, meaning that lightning’s occurrence is a variable with only two possible outcomes, 0 or 1 (i.e., absent or present). The probability of binary variables can be described by logistic equations or sigmoid functions. Our model employs logistic regression, which uses a supervised machine learning algorithm that performs binary classification tasks by predicting the probability of the occurrence of an event [
22]. The model development proceeded in four stages.
First, lightning data and NARR weather fields were aggregated and re-projected on a 20 km resolution grid. For NARR data, each month over the ConUS, and May through to September over AK, was represented by an average diurnal cycle with a 3 h resolution containing 8 temporal bins. Logistic equations for the probability of lightning strikes were derived for each temporal bin of every fire season month. Since AK has virtually no lightning events during the cold fall–winter season, we did not derive lightning forecast equations for the period from the 1 October to the 30 April for this area. In an effort to improve the overall skill of the lightning forecast model, given the large geographic extent of the lower 48 states, the ConUS domain was divided into 10 subregions by taking into account the differences in its latitude, topography, fuels, climate, and soil moisture regimes (
Figure 1). A single region was employed for AK to allow sufficient lightning events to be included in the analysis.
The second step of data processing involved the creation of weather and lightning climatologies for each temporal bin in every respective month, using 27-year-long data records for the ConUS and 10-year records for AK.
Thirdly, a PCA process known as “varimax/orthogonal rotation” [
23] was applied to the gridded NARR and lightning climatological datasets to analyze 611 potential meteorological drivers (predictors). The PCA reduced the large cohort of 3D meteorological drivers to a smaller subset of 13 statistically significant lightning predictors (principal components).
Finally, logistic regressions were performed using the first 13 PCs along with the lightning climatologies from every region and temporal bin to derive predictive equations for calculating the probabilities of “one or more” and “10 or more” strikes. The logistic equations have the following general form:
where
is the probability of a lightning strike,
is the
ith predictor (PC), and
is the estimated regression coefficient (weight) for that predictor. The “
R” statistical package [
24], an open source software, was utilized to perform both the PCA and the logistic regressions.
The above statistical procedure, combining PCA with logistic regression to predict lightning, was chosen instead of building a detailed process-oriented model using cloud microphysics for two reasons: (
a) the current mechanistic understanding of lightning’s generation is not detailed enough to allow for the development of accurate lightning-prediction models from “first principles”; and (
b) the historical, climatological datasets of coarse spatial resolution (20–32 km), such as NARR, available for model parameterizations do not contain information about cloud microphysical properties, which typically vary on spatial scales of tens of meters. Furthermore, our statistical procedure was found by previous research to be among the most successful techniques available for operational lightning forecasting [
12].
2.3. Model Verification Procedure
The model’s performance was evaluated using independent meteorological and lightning data from 2018 for the ConUS and from 2022 for AK. These were data not included in the statistical parametrization of the logistic lightning equations. The meteorological drivers for the verification came from two sources: NARR reanalysis fields and GFS grids. To avoid the biases introduced by forecast weather fields and allow a comparison between the NARR-driven 0–3 h forecast and GFS-driven lightning forecasts, we used GFS data from the same 0–3 h initialization period. Since GFS is an operational model run every 6 h that produces 7-day forecasts, it was ideal for incorporation into a 7-day forecast product. For this reason, the skill of the lightning model was also evaluated under forecast weather conditions, where projected GFS fields of up to 7 days were used.
The model verification procedure employed two primary statistical metrics: (a) reliability diagrams (RDs), comparing model-forecast probabilities to the observed relative frequencies of lightning strikes [
25]; and (b) Receiver Operating Characteristic (ROC) curves, quantifying the relationship between the False Positive Rates (on the X axis) and True Positive Rates (on the Y axis) at different classification thresholds [
26]. A key feature of the ROC diagrams is the Area Under the Curve (AUC), which provides an aggregate measure of the model’s performance across all possible classification thresholds. The closer the AUC is to 1.0, the higher the spatial accuracy of the model’s forecasts. RDs and ROC curves were calculated for the model predictions driven by observed weather fields from NARR and GFS (0–3 h initialization) and forecast GFS weather fields extending over 7 days.
As a means of qualitative assessment, we generated maps overlaying the forecast lightning probabilities with the observed number of lightning strikes in each 20 km grid cell. Specifically, we plotted the 24 h maximum lightning forecast probabilities against the observed lightning strikes in the same time period.
3. Results
The PCA yielded 13 principal components (PCs) that were employed as predictors for the 10 ConUS regions depicted in
Figure 1, for AK, and for all 3 h bins in every month. Each PC represents a linear combination of participating base variables. The components covered variables near the surface as well as at vertical pressure levels from 700 mb to 100 mb, in 50 mb increments (
Figure 2). Our PCA method with orthogonal rotation found similar terms and grouped them together into individual components. Note that CAPE (the Convective Available Potential Energy) was singled out as a separate PC despite the fact that it is derived from the variables in components 1, 3, 5, 8, and 11. This is because CAPE experiences a non-stationary relationship with these variables depending on the vertical temperature and moisture profiles. Hence, these components do not have a persistent, one-way correlation with CAPE and can either increase or decrease the value of CAPE.
The 13 PCs collectively explained nearly 61% of the total variance of CG strikes in all regions and time periods over the ConUS (
Figure 2). The temperatures and geopotential heights at every 50 mb vertical pressure level at a grid point emerged as the strongest predictor among the 13 PCs, accounting for 14.6% of the observed variance in CG strikes.
Logistic equations quantifying the probabilities of CG strikes were derived using the multi-year climatologies from each 3 h diurnal bin of every month and region, employing the 27-year-long records of NARR and lightning grids for the ConUS and 10-year long data records for AK. Each month of the year was described using eight logistic equations per region, i.e., one equation for every 3 h bin of the average 24 h diurnal cycle for that month. This produced 960 equations for the ConUS and 40 equations for AK.
The output files from the “R” software (Version 4.2.2) containing the lightning prediction equations were combined with the 3D meteorological fields generated by the GFS to produce operational lightning probability forecasts at 20 km resolution over 7 days for the ConUS and AK. The forecasts were delivered in a graphic format to field users through a customized website [
27].
Model Verification
Figure 3 compares side-by-side reliability diagrams of NARR- and GFS-driven 3 h forecasts for June, July, and August of 2018 over the ConUS, employing independent meteorological and lightning data. The histograms visible in the lower right corner of the reliability diagrams depict the areal fraction of the forecasts in each probability “bin”, represented by the open circles on the diagrams. These histograms reveal that the vast majority of forecasts refer to lower probabilities, with large areas having a near-zero chance of lightning.
Figure 4 shows the ROC curves and corresponding AUCs for the same data. The decimal values on the ROC curves represent the sample’s forecast probabilities of lightning and show where such probabilities fall with respect to the Hit Rate (the fraction of grid points with a correct prediction) and False-Alarm Rate (the fraction of grid points with an incorrect prediction). For example, in
Figure 4, 0.1 refers to a 10% forecast chance of lightning, which has a Hit Rate of ~48% and a False-Alarm Rate of ~7%. The lower the False-Alarm Rate is for any given Hit Rate, the larger the AUC and the better the forecast. AUC values that are close to or slightly higher than 0.9 indicate excellent-to-outstanding model accuracy in predicting the probabilities of one or more CG lightning strikes over the ConUS [
28]. Surprisingly, the GFS-driven forecasts (using the initial 3 h) performed slightly better than those driven by NARR data. This might be a result of recent code upgrades made to the GFS model by NOAA.
Figure 5 depicts the RDs of the 24 h maximum lightning probabilities for forecast days 1, 3, and 7 in August of 2018 over the ConUS. It is important to note that forecast probabilities greater than 0.7 only occur in less than 0.1% of grid points and that 99.9% of the ConUS area exhibits a chance of lightning strikes less than 0.7 (70%). Thus, the larger deviations of the red curves from the 1:1 line for probabilities greater than 0.7 which are visible in
Figure 3 and
Figure 5 are essentially immaterial with regard to the actual model’s performance.
Figure 5 illustrates that the model also yielded reliable predictions up to 7 days in advance when using the forecast GFS weather fields as drivers.
Figure 6 depicts a reliability diagram of the NARR-based lightning forecasts for Alaska, covering the entire fire season (1 May–31 August) of 2022. Once again, the red curve is very close to a perfect model data match (the 1:1 line) for all probabilities that are meaningfully represented over the AK domain.
Figure 7 compares the reliability diagrams of lightning predictions driven by either NARR data (left panel) or GFS initialization fields (right panel) during the most active portion of the 2022 fire season (1 June–31 August) in AK. Overall, the NARR-based lightning forecasts display better accuracy than those based on GFS initialization fields, considering that the ROC AUC values for the two drivers are almost identical.
Figure 8 shows the RDs and ROC curves for lightning predictions over AK, extending over 168 h and driven by the GFS forecast weather fields in June and July of 2022. Although more distant forecast periods might appear closer to the 1:1 “perfect” line in the RDs, the ROC curves show a gradual decrease in forecast accuracy (measured by the relationship between the Hit Rate and False-Alarm Rate), as each curve for later forecast periods has a smaller AUC value compared to earlier forecast periods. The AUC values decrease from 0.9328158 at forecast hours 0–3 to 0.8697127 at forecast hours 168–171. This indicates that the accuracy of lightning predictions deteriorates somewhat with the increasing uncertainty of numerical weather forecasts over time. Still, the AUC values indicate an outstanding model performance up to forecast hour 99 and excellent accuracy up to hour 147.
Figure 9 displays the observed lightning data overlaid on predicted 24 h maximum lightning probabilities for the ConUS for forecast days 1, 3, 5, and 7, beginning on 15 July 2018.
Figure 10 shows similar overlays for AK during two high-intensity lightning events that occurred on 4 June and 20 June in 2022. These overlay maps visually illustrate the ability of the lightning prediction model to successfully capture regions of intense lightning.
4. Conclusions
We presented here the first and only operational, gridded lightning probability forecast algorithm for the continental USA and Alaska, derived from 29-year-long data records, that runs on a uniform 20 km continental grid. The statistical procedure of model derivation revealed that the occurrence of lightning strikes primarily depends on three factors: the temperatures and geopotential heights across the pressure levels at a grid point, the amount of low-level atmospheric moisture, and the wind speed/wind direction. These physical variables apparently aid in controlling the vertical separation of electric charges in the lower troposphere during thunderstorms, at a 20 km resolution, increasing the voltage potential between the cloud deck and the ground to a level that triggers electrical discharges.
The results obtained from the model verification performed employing independent data indicate that probabilistic predictions of CG lightning strikes are feasible and can be statistically reliable up to 7 days in advance when driven by output from operational, numerical weather forecast models such as the GFS. Hence, this new lightning prediction algorithm can be incorporated into future forecasting systems of fire ignition probabilities to quantify the risk of naturally occurring wildfires at a national level.