3.1. Optimization of ERA5 Data
For the ERA5 data within the range of 117.5°E to 119.5°E longitude and 24.5°N to 25.5°N latitude, this research utilized the data optimization method proposed in
Section 2.2 to increase the spatial resolution to 0.01° × 0.01° (approximately 1 km × 1 km) and enhance the temporal resolution to 15 min. For example, considering the relative humidity at 500 hPa pressure at 08:00 on 1 January 2020,
Figure 3a illustrates the original resolution, while
Figure 3b represents the optimized data resolution.
For example, we optimized partial typical ERA5 parameters at 08:00 on 1 January 2020, and the results are shown in
Figure 4.
The original data have a temporal resolution of 1 h. However, one hour is too long compared to the duration of thunderstorm activity. This leads to data dispersion and, consequently, to poor results in correlation analysis. In this research, considering the entire year of 2020, with a total of 8784 h, we generated 35,136 time labels via the linear interpolation optimization method. To illustrate this, we present the relative humidity at 850 hPa pressure from 02:00 to 05:00 on 1 January 2020, as shown in
Figure 5.
3.2. Analysis of the chi-Squared Test Results
In 2020, a total of 13,584 data points were recorded by the lightning location system in the target area. After partitioning the lightning strikes based on time and space intervals, we merge lightning activities occurring at the same time within the same grid point. We obtained a total of 8978 data points. This data volume is significantly smaller compared to the typical grid division of meteorological parameters, which is 20,301 × 35,136. This indicates that there are significant differences in the occurrence of lightning in the natural environment. The chi-squared test requires samples of both lightning occurrences and non-lightning occurrences. How to select an appropriate ratio of positive to negative samples directly affects the results.
Figure 6 shows the chi-squared test results under different sampling ratios.
As can be seen from the above figure, as the sampling proportion increases it leads to an increase in the sample size, and the results of the chi-squared test also gradually increase. This reflects the inherent contradiction in the chi-squared test itself under large sample sizes, between theoretical results and practical applicability. According to the chi-squared expression, assuming that the distribution proportions of the samples remain unchanged, if the total sample size increases tenfold the corresponding chi-squared value will also increase tenfold. In this case, with the significance levels and degrees of freedom unchanged, the judgment results may change. In addition to the sampling ratio, the method of dividing sample intervals also directly affects the accuracy of the chi-squared test. The chi-square test requires ensuring that each interval division contains a sufficient number of samples, otherwise it may lead to deviations from the actual calculation results. When conducting chi-square tests on our parameters, we found that empty intervals occasionally occur when the number of divisions exceeded 10. This occurrence became more frequent when the number of divisions exceeded 15. When the number of divisions is too small, it can also lead to inaccurate analysis of parameter correlations. For example, in
Figure 7 the chi-square test results fluctuate for intervals 3–7, while the results for intervals 7–15 tend to stabilize. Based on the above reasons, we chose 10 intervals as the division standard.
Based on the analysis of the above results, for the correlation analysis between LLS data and ERA5 data, dividing the data into 10 equally spaced intervals produced better results. Due to the inherent limitations of the chi-squared test, when the sample size is large its significance test loses credibility. It can only judge the strength of the data correlation based on the magnitude of the chi-squared test value.
Figure 8 shows the chi-squared test results and differences between adjacent parameters for typical ERA5 parameters, with positive and negative samples extracted in a 1:1 ratio and divided into 10 equally spaced intervals.
Figure 8 shows that when the typical parameters are arranged in descending order based on the calculated chi-squared results, it is visually apparent that there is a significant difference between parameter u10 and parameter u700. Using this as a dividing line(marked with a red line in
Figure 8), the target parameters can be classified into a set of strongly correlated parameters and a set of weakly correlated parameters.
3.3. Apriori Association Rule Extraction
Based on the analysis in
Section 3.2, we ultimately selected the u10 parameter as the critical parameter. We selected the dataset with correlations stronger than this parameter for subsequent analysis. After that, we performed data extraction based on a 1:1 ratio of lightning occurrence samples and non-lightning samples. Each parameter was divided into 10 equally spaced intervals. The partially formed parameter matrix is shown in
Table 3 and
Appendix A, where A–M correspond to different parameters and numbers correspond to different intervals. The unit of parameters related to temperature is K, parameters related to humidity are given as percentages, and the unit of parameters related to wind is m/s.
Using the Apriori algorithm, we extracted high-frequency data intervals corresponding to each parameter and lightning occurrences, as shown in
Figure 9 and
Table 4. The confidence of all of the extracted strong association rules was greater than 0.6. It can be considered that when the environmental conditions reflect the intervals in
Table 4, lightning phenomena are more likely to occur. The parameters in this research can be classified into three types: A–D are related to temperature, E–G are related to humidity, and H–M are related to wind speed.
With a support threshold of 0.05 and lightning occurrence as the target outcome, association rules were derived from the entire database. The ten strongest association rules are presented in
Table 5. The occurrences of different parameter intervals within these ten rules are tabulated in
Figure 10.
The data involved in this research can generally be categorized into three types: temperature, humidity, and wind speed. When facing lightning events, strong association parameters of the same type often tend to occur simultaneously. The conclusion drawn from the previous analysis suggests that, for a particular area, the higher the number of parameters falling into the strong association intervals, the higher the probability of lightning’s occurrence. However, it is common for different parameters of the same type to simultaneously fall into strong association intervals. In such cases, it is essential not only to consider the number of parameters falling into strong association intervals but also to pay attention to the types of parameters involved. For example, the association rules in
Table 5 mainly involve temperature, humidity, and wind speed types. Taking temperature parameters as an example,
Figure 11 demonstrates the correlation between temperature-related parameters A–D. The support for each strong association parameter interval generally exceeds 0.3, and the confidence exceeds 0.6. This confirms that strong association intervals of the same type of parameter tend to occur simultaneously. During lightning warning, it is advisable to appropriately reduce the weight of subsequent occurrences of different parameters within the same type. To facilitate the display of the correlation between different parameters, we merged the strong association intervals of the same parameter. For example, intervals B8 and B9 were combined into the {B8, B9} interval.
On the basis of all lightning occurrences, we classified lightning strikes based on their amplitude and type. Regarding the amplitude of lightning strikes, we categorized it into six levels (W1–W6), with 20 kA as the threshold. The number and proportion of each type are illustrated in
Figure 12a. For lightning strike types, they can be divided into three categories: intra-cloud flash (IC), positive cloud-to-ground (PCG), and negative cloud-to-ground (NCG), labeled as Z0–Z2. The number and proportion of each type are shown in
Figure 12b.
The distribution of lightning current amplitude and lightning current type exhibits certain similarities. Taking the lightning type as an example, the support and confidence of the obtained association rules are illustrated in
Figure 13 below. The parameters selected in this research are not sensitive to changes in lightning type and amplitude. The same parameter intervals correspond to three different lightning current types, indicating three distinct association rules. For example, the A7 interval simultaneously corresponds to three different types of lightning currents: Z0, Z1, and Z2. The support and confidence of this association rule vary with the distribution proportion of different types of lightning events. The antecedent parameter intervals appearing in the association rules in
Figure 13 overlap with the antecedent parameter intervals in
Table 4, further demonstrating that the parameters selected in this study are more sensitive to whether lightning currents occur, while lacking discernment regarding changes in lightning amplitude and type.
The conclusion that the parameters selected in this research are not sensitive to changes in lightning current amplitude and type can be illustrated by analyzing the correlations between different parameters and lightning amplitude. After reducing the requirements for support and confidence, the results shown in
Figure 14 can be obtained. From this figure, we can conclude that there is no single correspondence between different parameters and lightning current amplitude. Instead, a certain temperature interval corresponds to multiple lightning current amplitude intervals simultaneously. The two graphs on the right in
Figure 15 show the support and confidence. The first concerns the antecedent A7 and all lightning amplitude intervals, while the second concerns the consequent W1 and all t2m intervals. It can be observed that as the proportion of intervals W1 to W6 gradually decreases, their corresponding support and confidence also decrease. Meanwhile, in the extracted association rules with W1 as the target result, the support and confidence for the corresponding A7 interval are higher, consistent with the strong association rules obtained in
Table 4.