A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States

Agrawal, Nikita; Nelson, Peder V.; Low, Russanne D.

doi:10.3390/rs15235501

Open AccessArticle

A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States

by

Nikita Agrawal

¹

,

Peder V. Nelson

² and

Russanne D. Low

^3,*

¹

Whitney M. Young High School, Chicago, IL 60607, USA

²

College of Earth, Ocean, Atmospheric Sciences, Oregon State University, Corvallis, OR 97331, USA

³

Institute for Global Environmental Strategies, Arlington, VA 22201, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(23), 5501; https://doi.org/10.3390/rs15235501

Submission received: 10 October 2023 / Revised: 20 November 2023 / Accepted: 23 November 2023 / Published: 25 November 2023

(This article belongs to the Special Issue Remote Sensing of Wildfires under Climate Change)

Download

Browse Figures

Versions Notes

Abstract

:

Large wildfires (>125 hectares) in the United States account for over 95% of the burned area each year. Predicting large wildfires is imperative; however, current wildfire predictive models are region-based and computationally intensive. Using a scalable model based on easily available environmental and atmospheric data, this research aims to accurately predict whether large wildfires will develop across the United States. The data used in this study include 2109 wildfires over 20 years, representing 14 million hectares burned. Remote sensing environmental data (Normalized Difference Vegetation Index—NDVI; Enhanced Vegetation Index—EVI; Leaf Area Index—LAI; Fraction of Photosynthetically Active Radiation—FPAR; Land Surface Temperature during the Day—LST Day; and Land Surface Temperature during the Night—LST Night) consisting of 1.3 billion satellite observations was used. Atmospheric reanalysis data (u component of wind, v component of wind, relative humidity, temperature, and geopotential) at four pressure levels (300, 500, 700, and 850 Ha) were also factored in. Six machine learning classification models (Logistic Regression, Decision Tree, Random Forest, eXtreme Gradient Boosting, K-Nearest Neighbors, and Support Vector Machine) were created and tested on the resulting dataset to determine their accuracy in predicting large wildfires. Model validation tests and variable importance analysis were performed. The eXtreme Gradient Boosting (XGBoost) classification model performed best in predicting large wildfires, with 90.44% accuracy, a true positive rate of 0.92, and a true negative rate of 0.88. Furthermore, towards environmental justice, an analysis was performed to identify disadvantaged communities that are also vulnerable to wildfires. This model can be used by wildfire safety organizations to predict large wildfires with high accuracy and prioritize resource allocation to employ protective safeguards for impacted socioeconomically disadvantaged communities.

Keywords:

MODIS; ERA5; burned area; climate change; Python

1. Introduction

Wildfires pose severe health and ecological consequences. In the United States, from 2011 to 2021, there were an average of 62,799 wildfires annually and an average of 3 million hectares impacted annually [1]. In 2021 alone, 58,985 wildfires burned 2.9 million hectares [1]—nearly a 4% increase in the average national number of acres burned from the previous 10 years [2].

The term “wildland fire” encompasses not only uncontrolled fires but also fires intentionally set as part of prescribed burns [3]. Uncontrolled fires, referred to as wildfires, contribute to approximately 15% of the total United States particle emissions each year, which is more than emissions from power plants and transportation combined [4]. The chemical emissions released from the wildfires then further contribute to climate change [5]. Wildfire smoke releases fine particulate matter (PM2.5), which is detrimental to respiratory health more than other than fine particles from other sources [6].

On the other hand, controlled use of fires—known as prescribed burning—is common around the world for positive environmental effects and to minimize the risk of uncontrolled wildfires [7]. Nutrients released from the burned material, which includes dead plants and animals, return more quickly into the soil than if they had slowly decayed over time. In this way, fire increases soil fertility, a benefit that has been exploited by farmers for centuries [7].

Climate change has been a key driver in increasing the risk and extent of wildfires in the western United States over the last two decades. Temperatures have been increasing rapidly and scientists fear that climate change is occurring faster than anticipated [8]. Factors for wildfire spread include increased drought, warmer conditions, and dryness of forest fuels—organic matter that burns and contributes to wildfire spread [9].

A key metric that is widely used to describe wildfire severity is the burned area [10,11], the amount of surface covered within a given perimeter enclosing the wildfire. The number of fires and area burned are indicators of the annual level of wildfire activity. As shown in Figure 1, although over the last 35 years the number of annual wildfires has decreased, the burned area has increased (data from [12]).

Only a small fraction of wildfires become catastrophic and account for the majority of area burned. Large fires (>125 hectares) account for more than 95% of the area burned by wildfires in the United States each year [13]. Wildfire predictive models are used to evaluate the potential outcomes of these factors and apply towards community readiness and mitigation planning.

Machine learning models are increasingly being applied towards scientific research, including wildfire science. The prediction of wildfire occurrence is complex, and the nonlinear nature of machine learning models is being acknowledged as potentially beneficial in this regard [14]. Large datasets from satellites with millions of wildfire observations have improved the prediction of current machine learning models. However, these current wildfire studies using machine learning are conducted on a regional basis. One study found 19 studies where machine learning studies were conducted only on specific regional datasets [15].

Besides studies on predicting wildfire occurrence, there is limited literature available on predicting the wildfire-burned area across multiple regions. For example, FARSITE is a two-dimensional model that depicts fire perimeter growth. The model shows a promising result in basic conditions as the prediction closely matches the actual fire boundary. However, it is computationally demanding, requiring integration of many variables, and the model’s accuracy varies widely across wildfires in different regions [16]. Another model, FIRECAST, is a convolutional neural network (CNN) used to predict the expected burned area of an active fire after 24 h [17]. However, this CNN model was trained on location specific input which was heavily restricted by the small size of the dataset [18]. Burned area predictive research should investigate more methodologies, especially at larger scales with more data and complex input variables [19].

Remote sensing is a useful technique for data collection wherein sensors aboard orbiting satellites, aircrafts, or drones or installed on the ground provide a wealth of data that can be used to assess conditions before a burn and assess the environmental impact of a historic burn [20]. It can be used to improve warning and preparedness and is also useful in disaster risk management through its ability to collect information and data in dangerous (e.g., during fire events) or inaccessible areas (e.g., impervious areas). This technology enables the monitoring of the Earth’s surface, ocean, and the atmosphere at several spatial-temporal scales, thus allowing climate system observations [21]. These techniques are more widely accessible due to lower costs related to satellite imagery. NASA’s remote sensor, Moderate Resolution Imaging Spectroradiometer (MODIS), is a key instrument aboard the Terra and Aqua satellites. Terra’s orbit around the Earth is timed so that it passes from north to south across the equator in the morning, while Aqua passes south to north over the equator in the afternoon. Terra MODIS and Aqua MODIS view the entirety of the Earth’s surface every 1 to 2 days, acquiring data in 36 spectral bands [22]. While there are other remote sensing tools—such as GOES-16, Landsat, and VIIRS—most research to date has used various iterations of the MODIS data from the Terra and Aqua satellites [23]. MODIS is a comprehensive sensor that collects environmental data on important wildfire factors and its data are available under NASA’s open data policy.

Reanalysis datasets provide a more geographically and temporally uniform alternative to point-based observations. A reanalysis dataset is a retrospective analysis in which a numerical weather prediction model is used to construct an initial guess of the previous state of the climate, which is subsequently updated with observations [24,25]. Although the reanalysis process’s faults and uncertainties are only partially known, these datasets are frequently used as a proxy for observations [26]. Reanalysis data also span numerous decades, making them ideal resources.

The European Centre for Medium-Range Weather Forecasts (ECMWF) has released the ERA5 dataset, its most advanced reanalysis output. It was designed and generated using procedures that provided numerous enhancements over the previous release, the ERA-Interim reanalysis tool. It has a higher geographical resolution, a more sophisticated assimilation mechanism, and additional data sources [27].

The purpose of this research is to predict whether large wildfires will develop by creating a reliable classification model that is based on easily accessible data, is not as computationally intensive as current models, and procures a high degree of accuracy for wildfires across the United States. It is important to understand environmental drivers of fire where weather, fuels, and topography are known to influence wildfire burned area and wildfire severity [28]. Predicting large wildfires is challenging and depends on a complex combination of factors such as temperature, vegetation, relative humidity, and wind speed [9]. Incorporating these factors, six machine learning models were developed and tested in this research.

2. Method

2.1. Materials

A spatial database of wildfires that occurred across the United States from 1992 to 2020 was retrieved from the United States Department of Agriculture (USDA) [29]. These wildfire records were acquired from the reporting systems of federal, state, and local fire organizations. The core data elements included discovery date, final fire size, and point of origination. The data were transformed to conform to the high quality data standards of the National Wildfire Coordinating Group [30].

The transformed database used, herein referred to as the Fire database, contains geo-referenced wildfire records during the 29-year period (1992 to 2020). This research used 2109 wildfire sites (1105 large wildfires and 1004 non-large wildfires) across the United States, representing 14 million hectares burned as shown in Figure 2, which were sampled per the National Interagency Coordination Center (NICC) annual report ratio [1].

The 1.3 billion NASA MODIS observations, from 2000 to 2020, were downloaded from the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC) data collection, which includes the following six key variables:

Normalized Difference Vegetation Index (NDVI) from the MOD13Q1 dataset [31];
Enhanced Vegetation Index (EVI) from the MOD13Q1 dataset [31];
Leaf Area Index (LAI) from the MOD15A2H dataset [32];
Fraction of Photosynthetically Active Radiation (FPAR) from the MOD15A2H dataset [32];
Land Surface Temperature during the Day (LST Day) from the MYD11A2 dataset [33];
Land Surface Temperature during the Night (LST Night) from the MYD11A2 dataset [33].

MOD13Q1 distinguishes between NDVI and EVI by employing specific algorithms; NDVI relies on the normalized difference between red and near-infrared reflectance, while EVI incorporates additional corrections to enhance sensitivity in dense vegetation areas [34]. MOD15A2H employs a dual parameter approach, extracting LAI to quantify one-sided green leaf area and FPAR to represent the fraction of photosynthetically active radiation absorbed by green vegetation [35]. MYD11A2 uses a day/night algorithm where daytime and nighttime LSTs are retrieved from pairs of day and night MODIS observations [36]. For each of the six variables, annual averages spanning the three years prior to each wildfire occurrence were computed for a total of 18 environmental variables.

The fifth generation ECMWF atmospheric reanalysis (ERA5) data [37] were obtained to help relate the final wildfire burned area to any spatial patterns in five atmospheric variables on the day the wildfire started at four pressure levels (300, 500, 700, and 850 hPa). These five variables for the four pressure levels thus accounted for a total of twenty atmospheric variables used in the research. The five atmospheric variables used were:

u component of wind (eastward wind);
v component of wind (northward wind);
relative humidity;
temperature;
geopotential.

In recent research on elevation-dependent forest fires of western US [38,39], it was found that although higher elevations were historically wet enough to buffer fire ignition and slow/hinder fire propagation, forest fires of the western United States have advanced upslope over the past few decades, scorching territories previously too wet to burn. Thus, in recent decades, higher elevation has become conducive to large fire activity in the western United States. This makes factoring higher elevation in this research important. Thus, beyond geopotential at 850 hPa being considered, it is of importance to factor geopotential up to 300 hPa as well.

Table 1 shows the variables used in this project. Python version 3.9.13 on Jupyter Notebooks, a Python development environment, was used to develop Python code for this project. The data and code along with instructions on executing are available publicly on Zenodo [40,41].

2.2. Methodology

In this project, for each wildfire occurrence in the Fire database, the 18 environmental variable averages and 20 atmospheric variables (total 38 variables) were inputted into six selected machine learning models to analyze model accuracy for large wildfire classification and to identify variable importance for each model. The overall methodology is shown in Figure 3.

2.2.1. Processing

Since MODIS collects observations of NDVI, EVI, FPAR, LAI, LST Day, and LST Night from 2000, while the Fire database sourced from USDA contains wildfire occurrences up until 2020, wildfire occurrences from 2000 to 2020 were analyzed in this research.

Other than the location of origin of the wildfire, it is important to consider the geographic features of the surrounding vicinity to estimate how far the wildfire will spread. Furthermore, environmental and atmospheric variables are both drivers of wildfire activity. However, the impact of environmental variables on wildfire spread builds up over the long-term while instantaneous atmospheric variables influence wildfire behavior in the short-term [42].

Therefore, for each wildfire occurrence, MODIS data up to three years prior to the wildfire start date were processed and three annual averages leading up to the wildfire occurrence were computed, as opposed to monthly averages, in order to eliminate seasonal variations within each environmental variable. The 20 instantaneous ERA5 atmospheric reanalysis variables at the wildfire start date were obtained. Both environmental and atmospheric data were gathered from a 10 km by 10 km grid surrounding area centered at the location of origination of the wildfire. Spatial autocorrelation is prevalent in the context of predicting wildfire burned area because areas close to each other have similar characteristics [43]. Therefore, taking an average of the 10 km by 10 km grid helps eliminate this issue.

Figure 4 shows an example of the 10 km by 10 km geographical grid for LAI and FPAR data, which were taken at a spatial resolution of 0.5 km. The wildfire location, the true classification for each of the wildfire occurrences, the 18 environmental variables, and the 20 atmospheric variables were stored into a Python data frame.

2.2.2. Modeling

Studies on predicting wildfire severity through machine learning have tested combinations of machine learning models [19,44]. Through these studies, we identified the six best-performing machine learning classification models. Therefore, the modeling process was executed using the following six machine learning classification models: (i) Logistic Regression, (ii) Decision Tree, (iii) Random Forest, (iv) Extreme Gradient Boosting (XGBoost), (v) K-Nearest-Neighbors (KNN) with k value of 11 [45], and (iv) Support Vector Machine (SVM). The Python scikit-learn library was used to create these six machine learning models. The input to the modeling process was the data frame resulting from the data processing of the multiple wildfire occurrences across regions. To ensure randomness of wildfire sites inputted to the machine learning models, the order of wildfire occurrence data within the data frame was shuffled using a constant seed. We used k-fold cross-validation [46] to determine a more reliable accuracy score using a k-value of 10. Specifically for this research, the input data were randomly split into 10 subsets (also known as folds). The models were repeatedly trained on all but one of the folds, which was the one subset that was not used for training. Therefore, the shuffled data frame was repeatedly split into a 90% (9/10 folds) train to 10% (1/10 folds) test ratio, and the model’s generalized accuracy score was an average of the 10 trials. The training set was used to fit the machine learning models to predict large wildfires. The testing set was unknown to the model during the training period and was used to determine a generalized overall model accuracy.

2.2.3. Evaluation

Two commonly used evaluation metrics for binary classification are (i) accuracy, denoting the percentage of correctly classified observations, and (ii) the area under the curve (AUC), derived from the receiver operating characteristic (ROC) curve.

Accuracy for each of the six models was determined through the confusion matrix: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) values.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \times 100 %

(1)

Additional metrics true positive rate (TPR) signifies the percentage of correctly classified positive observations, while true negative rate (TNR) denotes the percentage of correctly classified negative observations.

TPR = \frac{T P}{T P + F N}

(2)

TNR = \frac{T N}{T N + F P}

(3)

For each of the six models, a second validation test was performed by comparing the model’s TPR to its false positive rate (FPR) by analyzing each model’s receiver operating characteristic curve (ROC curve). The TPR is the proportion of occurrences that the model correctly predicted as large wildfires out of all large wildfire occurrences, represented by Formula 2. The FPR is the proportion of occurrences that the model incorrectly predicted as large wildfires out of all non-large wildfire occurrences.

FPR = \frac{F P}{T N + F P}

(4)

The area under the curve (AUC) is a widely used measure of validating a model’s performance. An AUC score of 0.5 indicates random classification, an AUC score ranging from 0.5 to 0.7 is considered poor, an AUC score between 0.7 to 0.9 is considered moderate, and AUC scores above 0.9 are considered excellent.

Variable importance is key to understanding which factors are most significant in large wildfire classification. To determine the variables that have the most predictive abilities, permutation variable importance analysis was performed. The permutation variable importance is defined as the decrease in a model score when a single variable value is randomly shuffled. This procedure breaks the relationship between the variable and the target, thus the drop in the model score is indicative of how much the model depends on the variable. This technique benefits from being model agnostic and can be calculated many times with different permutations of the variables.

3. Results

The results from the modeling process, for the six machine learning models, were evaluated for (i) model accuracy analysis, (ii) model validation, and (iii) identification of important variables from the 38 variables used in this research, as per the methodology established earlier.

3.1. Model Accuracy Analysis

For each of the six models, the accuracy score was determined by how many classifications the model correctly predicted out of the total number of predictions through k-fold cross-validation. A one-sample t-test was performed on the 10 accuracy scores generated through the k-fold cross-validation process to test whether the mean accuracy is statistically significant using a significance level of p-value = 0.05. If the p-value is less than or equal to 0.05, it suggests that the observed mean was unlikely to have occurred by random chance alone. Table 2 shows the accuracy score and corresponding p-value for each of the models, where the XGBoost Classification model has the highest accuracy score and the Random Forest Classification model is a close second.

3.2. Model Validation

3.2.1. Confusion Matrix

For each of the six models, one of the two validation tests performed compared the actual wildfire classification to the model’s predicted wildfire classification through its confusion matrix which asserts that the data input to the model was balanced. This is represented across the six models in Figure 5. The TP are represented in the bottom right quadrant, the TN are represented in the top left quadrant, the FP are represented in the top right quadrant, and the FN are represented in the bottom left quadrant. The XGBoost Classification model was found to have performed the best due to its high TPR and TNR with the Random Forest Classification model being a close second.

3.2.2. AUC Score

The AUC score for each of the six machine learning classification models were calculated based on the ROC curve. This is represented across the six models in Figure 6. The XGBoost Classification model performed the best because it had the highest AUC score while the Random Classification model had the second highest AUC score.

3.2.3. Identification of Important Variables

Finally, we identified the importance of each of the 38 variables (18 environmental and 20 atmospheric variables) used. There are similar trends between the top two classification models: Random Forest and XGBoost. By concentrating on these two top-performing models, this analysis prioritizes understanding the key factors contributing to their success. This can offer valuable insights into the variables that have the most significant impact on accurate predictions, providing a more targeted and efficient way to interpret the model outcomes. Additionally, examining these two models helps streamline the analysis process, as it allows for a more concentrated investigation into a subset of models that have demonstrated superior predictive power. Figure 7 shows each variable’s mean accuracy decrease for the Random Forest and XGBoost Classification models (refer to Table 3 for the color code used in Figure 7). The further out to the right a bar extends, the more important that variable data are to a model’s predictions. The environmental variables LST Night, LST Day, and LAI as well as atmospheric variables geopotential and relative humidity were determined to be the most significant.

One explanation for the results shown in Figure 7 is that two years ago, a given region experienced favorable conditions, leading to a high LAI and dense vegetation cover. However, one year ago, the region saw elevated temperatures, causing stress to the vegetation that thrived during the high LAI period. This led to increased vegetation dryness and the accumulation of excess forest fuel. In the present day, there were anomalies in geopotential height indicating atmospheric circulation patterns that favored dry and stable conditions. This resulted in lower relative humidity levels, contributing to the overall fire risk. In this scenario, the combination of high LAI two years previously, high LST Day one year previously, and unfavorable atmospheric conditions have created circumstances conducive to wildfires. The excess dry fuel, coupled with low relative humidity and potentially other ignition sources, could have led to rapid spread, resulting in a large wildfire burning over 125 hectares.

4. Discussion

In this project, 2109 wildfire occurrences across the United States, from 2000 to 2020, were analyzed. Easily accessible data were retrieved from USDA, the NASA MODIS remote sensor, and ERA5 reanalysis data. Six machine learning models—Logistic Regression, Decision Tree Classification, Random Forest Classification, XGBoost Classification, KNN Classification, and SVM Classification—were developed to predict whether a large wildfire would develop by incorporating the data. Additionally, the most important variables for the two top-performing models were identified.

The XGBoost Classification model performed the best in predicting large wildfires with an accuracy score of 90.44%, thereby showing high accuracy. The Random Forest Classification model performed the second-best with an accuracy score of 87.62% and comparable TPR, TNR, and AUC metrics as the XGBoost Classification model. Furthermore, both models showed similar trends with environmental variables LST Night, LST Day, and LAI, as well as atmospheric variables geopotential and relative humidity, as the most significant.

This integration of diverse and refined datasets enables a more holistic approach to fire modeling. The XGBoost Classification model created here can assimilate real-world data with high accuracy and reliability, a feature that is not present in the existing FARSITE model [47]. Moreover, the geographic flexibility of the MODIS Remote Sensing data and the ERA5 Reanalysis data allow for the XGBoost Classification model to be adaptable to different regions, thereby overcoming the regional limitations of the existing FIRECAST model, which was applied to the Rocky Mountains region only.

Recently, the Federal Government established the Justice40 Initiative. Through this initiative, 40% of the benefits of federal assistance will go to disadvantaged communities so that these overburdened communities can receive the vital resources they need [48]. The Justice40 Initiative takes into account several indicators that have been collected from a wide variety of sources, including the U.S. Census Bureau, Environmental Protection Agency, Centers for Disease Control and Prevention, Department of Transportation, Department of Energy, Federal Emergency Management Agency, and Department of Housing and Urban Development [49]. These indicators are then used to determine whether a community is disadvantaged.

One of the programs that the Justice40 Initiative covers is “Reducing Wildfire Risk to Tribes, Underserved, and Socially Vulnerable Communities.” The Fiscal Year 2024 Budget provides USD 323 million to the USDA and USD 314 million to the Department of the Interior to help reduce the risk and severity of wildfires [50]. With a limited budget and resources available, it is imperative to optimize resource allocation judiciously and equitably. To that extent, we performed a spatial analysis depicting where disadvantaged communities and wildfires predicted by the XGBoost Classification model overlap across the United States, as shown in Figure 8. This spatial analysis highlights vulnerable disadvantaged geographical areas that are impacted by large wildfires (circled in black—Oklahoma and Northern California) and non-large wildfires (circled in green—New Jersey, Kentucky, Arkansas, and Florida). Such should be treated with high priority for federal assistance and, per the Justice40 budget, receive nearly USD 255 million to safeguard against wildfires.

Additionally, this study highlights the 38 variables’ importance for each of the six machine learning models developed. However, the variables in this research are not all-inclusive. For instance, this study does not incorporate how human impacts and behavior such as those that cause wildfires through ignition, suppression, or altering fuel distribution affect wildfire burned area size. Future research is required to better understand how human activity contributes to climate change and what it means for wildfire prediction.

Wildfires in ecosystems are natural and crucial for certain plant species, such as redwoods in California, whose cones rely on the heat from fires to trigger seed germination. However, machine learning models, including the XGBoost model developed in this research, do not fully grasp these nuanced ecological cycles and adaptations. These models, built on historical data, may struggle to adapt to the intricacies of dynamic natural processes. Future research should not only draw from historical data but also be dynamic and adaptive, capable of responding to evolving ecological conditions.

5. Conclusions

This study aims to predict whether large wildfires will develop by leveraging machine learning classification models. Large wildfires are complex events influenced by various environmental and atmospheric factors, making them challenging to predict accurately. The researchers utilized a dataset from NASA MODIS and ERA5 to capture a comprehensive overview of wildfires across the United States, allowing for a more general application for large wildfire prediction compared to other existing fire models, namely FARSITE and FIRECAST.

In this study, we developed and compared the prediction performances of six different machine learning classification models in predicting whether large wildfires would develop. The models were trained on features extracted from the environmental and atmospheric variables present in the dataset. The results indicate that the XGBoost Classification model outperformed the other five models across all metrics presented, achieving an accuracy score of 90.44%.

Additionally, fire safety organizations can leverage the XGBoost Classification model developed in this research to predict large wildfires with a greater accuracy to employ protective safeguards early on and reduce the spread of wildfires. By accurately forecasting the development of wildfires, these organizations can implement protective measures in a timely manner, potentially reducing the spread of wildfires and mitigating their impact.

Moreover, the study highlights the potential for improved resource allocation. By predicting large wildfires more accurately, fire safety organizations can allocate federal aid and resources more effectively and economically. This targeted allocation becomes especially crucial for supporting disadvantaged communities that are disproportionately burdened and impacted by large wildfires.

This study emphasizes the significance of using the XGBoost Classification model developed here as a tool for predicting large wildfires across the United States and applying it towards environmental justice. This suggests a broader societal impact, as accurate wildfire predictions can contribute to minimizing the disparate impact of wildfires on different communities, aligning with the broader goals of environmental justice and equitable resource distribution.

Author Contributions

Conceptualization, N.A.; methodology, N.A.; software, N.A.; validation, N.A.; formal analysis, N.A.; investigation, N.A.; resources, N.A.; data curation, N.A.; writing—original draft preparation, N.A.; writing—review and editing, N.A., P.V.N. and R.D.L.; visualization, N.A.; supervision, N.A. and R.D.L.; project administration, N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this project can be found on Zenodo [40]. The code used in this project can also be found on Zenodo [41].

Acknowledgments

Nikita Agrawal would like to thank Russanne Low for her mentorship and the NASA STEM Enhancement in Earth Science program for their support and guidance. Nikita Agrawal would also like to thank Anna Gallardo, Andrew Mauer-Oats, and Patrick McGuire for their advice and support.

Conflicts of Interest

The authors declare no conflict of interest.

References

National Interagency Coordination Center Wildland Fire Summary and Statistics Annual Report 2021. Available online: https://www.nifc.gov/sites/default/files/NICC/2-Predictive%20Services/Intelligence/Annual%20Reports/2021/annual_report_0.pdf (accessed on 1 August 2023).
Explainer: How Wildfires Start and Spread. Available online: https://cnr.ncsu.edu/news/2021/12/explainer-how-wildfires-start-and-spread (accessed on 1 August 2023).
Federal Wildland Fire Policy Terms and Definitions. Available online: https://www.nwcg.gov/sites/default/files/docs/eb-fmb-m-19-004a.pdf (accessed on 1 August 2023).
Air Pollutant Emissions Trends Data. Available online: https://www.epa.gov/air-emissions-inventories/air-pollutant-emissions-trends-data (accessed on 1 August 2023).
Buis, A. The Climate Connections of a Record Fire Year in the U.S. West—Climate Change: Vital Signs of the Planet. NASA. 2021. Available online: https://climate.nasa.gov/explore/ask-nasa-climate/3066/the-climate-connections-of-a-record-fire-year-in-the-us-west/ (accessed on 1 August 2023).
Aguilera, R.; Corringham, T.; Gershunov, A.; Benmarhnia, T. Wildfire smoke impacts respiratory health more than fine particles from other sources: Observational evidence from Southern California. Nat. Commun. 2021, 12, 1493. [Google Scholar] [CrossRef]
Francos, M.; Úbeda, X. Prescribed fire management. Curr. Opin. Environ. Sci. Health 2021, 21, 100250. [Google Scholar] [CrossRef]
Climate Change Widespread, Rapid, and Intensifying. Available online: https://www.ipcc.ch/2021/08/09/ar6-wg1-20210809-pr/ (accessed on 1 August 2023).
Wehner, M.F.; Arnold, J.R.; Knutson, T.; Kunkel, K.E.; LeGrande, A.N. Ch. 8: Droughts, floods, and wildfires. Clim. Sci. Spec. Rep. Fourth Natl. Clim. Assess. 2017, 1, 231–256. [Google Scholar]
Keeley, J.E. Fire intensity, fire severity and burn severity: A brief review and suggested usage. Int. J. Wildland Fire 2009, 18, 116. [Google Scholar] [CrossRef]
Bowman, D.M.J.S.; Balch, J.K.; Artaxo, P.; Bond, W.J.; Carlson, J.M.; Cochrane, M.A.; D’Antonio, C.M.; DeFries, R.S.; Doyle, J.C.; Harrison, S.P.; et al. Fire in the Earth System. Science 2009, 324, 481–484. [Google Scholar] [CrossRef] [PubMed]
National Interagency Coordination Center. Wildfires and Acres. National Interagency Fire Center. Available online: https://www.nifc.gov/fire-information/statistics/wildfires (accessed on 1 August 2023).
Wildland Fire and Climate Change. Available online: www.fs.usda.gov/ccrc/topics/wildfire (accessed on 1 August 2023).
Jain, P.; Coogan, S.C.P.; Subramanian, S.G.; Crowley, M.; Taylor, S.W.; Flannigan, M.D. A review of machine learning applications in wildfire science and management. Environ. Rev. 2020, 28, 478–505. [Google Scholar] [CrossRef]
Shmuel, A.; Heifetz, E. Global Wildfire Susceptibility Mapping Based on Machine Learning Models. Forests 2022, 13, 1050. [Google Scholar] [CrossRef]
Bolt, A.; Huston, C.; Kuhnert, P.; Dabrowski, J.J.; Hilton, J.; Sanderson, C. A spatio-temporal neural network forecasting approach for emulation of Firefront models. In Proceedings of the Signal Processing: Algorithms, Architectures, Arrangements, and Applications, Poznan, Poland, 21–22 September 2022; pp. 110–115. [Google Scholar]
Radke, D.; Hessler, A.; Ellsworth, D. FireCast: Leveraging deep learning to predict wildfire spread. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019. [Google Scholar]
Wildfire Burn Area Prediction. Available online: https://cs229.stanford.edu/proj2019aut/data/assignment_308832_raw/26582553.pdf (accessed on 1 August 2023).
Coffield, S.R.; Graff, C.A.; Chen, Y.; Smyth, P.; Foufoula-Georgiou, E.; Randerson, J.T. Machine learning to predict final fire size at the time of ignition. Int. J. Wildland Fire 2019, 28, 861–873. [Google Scholar] [CrossRef]
Wildfires Data Pathfinder. Available online: https://www.earthdata.nasa.gov/learn/pathfinders/wildfires-data-pathfinder#:~:text=%20Data%20collected%20by%20sensors%20aboard,impact%20of%20an%20historic%20burn (accessed on 1 August 2023).
Yang, J.; Gong, P.; Fu, R.; Zhang, M.; Chen, J.; Liang, S.; Xu, B.; Shi, J.; Dickinson, R. The role of Satellite Remote Sensing in climate change studies. Nat. Clim. Change 2013, 3, 875–883. [Google Scholar] [CrossRef]
MODIS. Available online: https://modis.gsfc.nasa.gov/about/ (accessed on 1 August 2023).
Botje, D.; Dewan, A.; Chakraborty, T. Comparing Coarse-Resolution Land Surface Temperature Products over Western Australia. Remote Sens. 2022, 14, 2296. [Google Scholar] [CrossRef]
Rabier, F. Overview of global data assimilation developments in numerical weather-prediction centres. Q. J. R. Meteorol. Soc. 2005, 131, 3215–3233. [Google Scholar] [CrossRef]
Dee, D.P.; Uppala, S.M.; Simmons, A.J.; Berrisford, P.; Poli, P.; Kobayashi, S.; Andrae, U.; Balmaseda, M.A.; Balsamo, G.; Bauer, P.; et al. The ERA-Interim reanalysis: Configuration and performance of the data assimilation system. Q. J. R. Meteorol. Soc. 2011, 137, 553–597. [Google Scholar] [CrossRef]
Parker, W.S. Reanalyses and Observations: What’s the Difference? Bull. Am. Meteorol. Soc. 2016, 97, 1565–1572. [Google Scholar] [CrossRef]
Tarek, M.; Brissette, F.P.; Arsenault, R. Evaluation of the ERA5 reanalysis as a potential reference dataset for hydrological modelling over North America. Hydrol. Earth Syst. Sci. 2020, 24, 2527–2544. [Google Scholar] [CrossRef]
Holsinger, L.; Parks, S.A.; Miller, C. Weather, fuels, and topography impede wildland fire spread in western US Landscapes. For. Ecol. Manag. 2016, 380, 59–69. [Google Scholar] [CrossRef]
Short, K.C. Spatial Wildfire Occurrence Data for the United States, 1992–2020 [FPA_FOD_20221014], 6th ed.; Forest Service Research Data Archive: Fort Collins, CO, USA, 2022. [Google Scholar]
NWCG. NWCG Data Standards, PMS 910. National Wildfire Coordinating Group. 2021. Available online: https://www.nwcg.gov/data-standards (accessed on 1 August 2023).
Didan, K. MOD13Q1 MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V006 [Data set]. NASA EOSDIS Land Processes Distributed Active Archive Center. 2015. Available online: https://lpdaac.usgs.gov/products/mod13q1v006/ (accessed on 1 August 2023).
Myneni, R.; Knyazikhin, Y.; Park, T. MOD15A2H MODIS/Terra Leaf Area Index/FPAR 8-Day L4 Global 500m SIN Grid V006 [Data set]. NASA EOSDIS Land Processes Distributed Active Archive Center. 2015. Available online: https://lpdaac.usgs.gov/products/mod15a2hv006/ (accessed on 1 August 2023).
Wan, Z.; Hook, S.; Hulley, G. MODIS/Aqua Land Surface Temperature/Emissivity 8-Day L3 Global 1km SIN Grid V061 [Data set]. NASA EOSDIS Land Processes Distributed Active Archive Center. 2021. Available online: https://lpdaac.usgs.gov/products/myd11a2v061/ (accessed on 1 August 2023).
MODIS Vegetation Index Products (NDVI and EVI). NASA. Available online: https://modis.gsfc.nasa.gov/data/dataprod/mod13.php (accessed on 1 August 2023).
MODIS Leaf Area Index/FPAR. NASA. Available online: https://modis.gsfc.nasa.gov/data/dataprod/mod15.php (accessed on 1 August 2023).
Wen, Z. Modis Land Surface Temperature Products—USGS. Collection-6 MODIS Land Surface Temperature Products Users’ Guide. 2013. Available online: https://lpdaac.usgs.gov/documents/118/MOD11_User_Guide_V6.pdf (accessed on 1 August 2023).
ECMWF. ERA5: Data Documentation. ECMWF Confluence. 2023. Available online: https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation (accessed on 1 August 2023).
Alizadeh, M.R.; Abatzoglou, J.T.; Luce, C.H.; Adamowski, J.F.; Farid, A.; Sadegh, M. Warming enabled upslope advance in western US forest fires. Proc. Natl. Acad. Sci. USA 2021, 118, e2009717118. [Google Scholar] [CrossRef] [PubMed]
Alizadeh, M.R.; Abatzoglou, J.T.; Adamowski, J.; Modaresi Rad, A.; AghaKouchak, A.; Pausata, F.S.; Sadegh, M. Elevation-dependent intensification of fire danger in the Western United States. Nat. Commun. 2023, 14, 1773. [Google Scholar] [CrossRef]
Agrawal, N. Wildfire Data, Zenodo [Data Set], 2023. Available online: https://zenodo.org/records/10042739 (accessed on 1 August 2023).
Agrawal, N. nagrawa6/Wildfire: v2.0.0 (v2.0.0). Zenodo. 2023. Available online: https://zenodo.org/records/6939189 (accessed on 1 August 2023).
Ruffault, J.; Curt, T.; Moron, V.; Trigo, R.M.; Mouillot, F.; Koutsias, N.; Pimont, F.; Martin-StPaul, N.; Barbero, R.; Dupuy, J.L.; et al. Increased likelihood of heat-induced large wildfires in the Mediterranean Basin. Sci Rep. 2020, 10, 13790. [Google Scholar] [CrossRef]
Schag, G.M.; Stow, D.A.; Riggan, P.J.; Nara, A. Spatial-Statistical Analysis of Landscape-Level Wildfire Rate of Spread. Remote Sens. 2022, 14, 3980. [Google Scholar] [CrossRef]
Pahuja, N.K.; Rivero, M.H. Predicting the impact of wildfire using machine learning techniques to assist effective deployment of resources. In Proceedings of the 2022 International Conference on Computational Science and Computational Intelligence (CSCI), Hong Kong, China, 25 September 2022. [Google Scholar] [CrossRef]
Paryudi, I. What Affects K Value Selection In K-Nearest Neighbor. Int. J. Sci. Technol. Res. 2019, 8, 86–92. [Google Scholar]
Scikit Learn. (n.d.). 3.1. Cross-Validation: Evaluating Estimator Performance. Available online: https://scikit-learn.org/stable/modules/cross_validation.html (accessed on 1 August 2023).
Srivas, T.; Artés, T.; de Callafon, R.A.; Altintas, I. Wildfire spread prediction and assimilation for FARSITE using ensemble Kalman filtering 1. Procedia Comput. Sci. 2016, 80, 897–908. [Google Scholar] [CrossRef]
Justice40 Initiative. Available online: https://www.whitehouse.gov/environmentaljustice/justice40/ (accessed on 1 August 2023).
Climate and Economic Justice Screening Tool Downloads. Available online: https://screeningtool.geoplatform.gov/en/downloads#7.83/46.314/-88.988 (accessed on 1 August 2023).
Office of Management and Budget. Budget of the U.S. Government. The White House. Available online: https://www.whitehouse.gov/wp-content/uploads/2023/03/budget_fy2024.pdf (accessed on 9 March 2023).

Figure 1. Graph depicting that although the number of annual wildfires has decreased from 1985 to 2022, the burned area has increased over the past 35 years (data from [12]). The red line shows the number of wildfire occurrences (in thousands), and the gray shaded area shows the hectares burned (in millions).

Figure 2. Map depicting the 2109 wildfire sites across the United States used in this project per the NICC ratio. The red points represent large wildfire occurrences with a burned area of greater than or equal to 125 hectares. The purple points represent non-large wildfire occurrences with a burned area of less than 125 hectares: (a) wildfire sites sampled in Alaska; (b) wildfire sites sampled in the continental United States; (c) wildfire sites sampled in Hawaii.

Figure 3. Flowchart depicting the logical subsystems in the methodology of this research.

Figure 4. An example of the 10 km by 10 km geographical grid for the MODIS LAI and FPAR data retrieved for a sample wildfire site. The purple square represents the geographical coordinate of the wildfire’s location of origination, whereas the surrounding white squares represent the geographical pixels in the surrounding vicinity.

Figure 5. Validation of actual vs. predicted large wildfire classification through confusion matrix: (a) Logistic Regression model with a TPR of 0.75 and a TNR of 0.64; (b) Decision Tree Classification model with a TPR of 0.83 and a TNR of 0.67; (c) Random Forest Classification model with a TPR of 0.86 and a TNR of 0.88; (d) XGBoost Classification model with a TPR of 0.92 and a TNR of 0.88; (e) KNN Classification model with a TPR of 0.75 and a TNR of 0.57; (f) SVM Classification model with a TPR of 0.78 and a TNR of 0.60.

Figure 6. Validation of actual vs. predicted large wildfire classification through ROC Curve: (a) Logistic Regression model; (b) Decision Tree Classification model; (c) Random Forest Classification model; (d) XGBoost Classification model; (e) KNN Classification model; (f) SVM Classification model.

Figure 7. Mean accuracy decrease, measuring variable importance: (a) Random Forest Classification model; (b) XGBoost Classification model.

Figure 8. Map of the United States depicting vulnerable geographical areas as being disadvantaged and wildfires predicted by the XGBoost Classification model. The red points represent large wildfire occurrences from 2018 to 2020 with a burned area of greater than or equal to 125 hectares. The purple points represent non-large wildfire occurrences with a burned area of less than 125 hectares. The dark orange areas represent disadvantaged communities per the Justice40 Initiative. The black circles represent environmentally disadvantaged communities that are impacted by large wildfires. The green circles represent environmentally disadvantaged communities that are impacted by non-large wildfires: (a) Alaska; (b) continental United States; (c) Hawaii.

Table 1. Variables used in this research project and their source.

Variable Name	Source
Normalized Difference Vegetation Index (NDVI)	MODIS (Product: MOD13Q1)
Enhanced Vegetation Index (EVI)	MODIS (Product: MOD13Q1)
Leaf Area Index (LAI)	MODIS (Product: MOD15A2H)
Fraction of Photosynthetically Active Radiation (FPAR)	MODIS (Product: MOD15A2H)
Land Surface Temperature during the Day (LST Day)	MODIS (Product: MYD11A2)
Land Surface Temperature during the Night (LST Night)	MODIS (Product: MYD11A2)
u component of wind (eastward wind)	ERA5
v component of wind (northward wind)	ERA5
Relative humidity	ERA5
Temperature	ERA5
Geopotential	ERA5

Table 2. Accuracy score and significance level of the six machine learning models used in this project. A p-value less than or equal to 0.05 indicates statistical significance and is shown as bolded.

Model Type	Accuracy Score	Significance Level
Logistic Regression	69.81%	p-value = 0.4776
Decision Tree Classification	80.19%	p-value = 0.6029
Random Forest Classification	87.62%	p-value = 0.04664
XGBoost Classification	90.44%	p-value = 0.04727
KNN Classification	67.48%	p-value = 0.2949
SVM Classification	69.95%	p-value = 0.1454

Table 3. Legend of the colors used in Figure 7.

Color	Variable Type
	v component of wind
	u component of wind
	temperature
	relative humidity
	geopotential
	LST Night
	LST Day
	LAI
	FPAR
	NDVI
	EVI

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Agrawal, N.; Nelson, P.V.; Low, R.D. A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States. Remote Sens. 2023, 15, 5501. https://doi.org/10.3390/rs15235501

AMA Style

Agrawal N, Nelson PV, Low RD. A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States. Remote Sensing. 2023; 15(23):5501. https://doi.org/10.3390/rs15235501

Chicago/Turabian Style

Agrawal, Nikita, Peder V. Nelson, and Russanne D. Low. 2023. "A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States" Remote Sensing 15, no. 23: 5501. https://doi.org/10.3390/rs15235501

APA Style

Agrawal, N., Nelson, P. V., & Low, R. D. (2023). A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States. Remote Sensing, 15(23), 5501. https://doi.org/10.3390/rs15235501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Approach for Predicting Large Wildfires Using Machine Learning towards Environmental Justice via Environmental Remote Sensing and Atmospheric Reanalysis Data across the United States

Abstract

1. Introduction

2. Method

2.1. Materials

2.2. Methodology

2.2.1. Processing

2.2.2. Modeling

2.2.3. Evaluation

3. Results

3.1. Model Accuracy Analysis

3.2. Model Validation

3.2.1. Confusion Matrix

3.2.2. AUC Score

3.2.3. Identification of Important Variables

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI