1. Introduction
Landslide susceptibility models are used to predict the spatial occurrence of slope failure given a range of geo-environmental conditions, allowing identification of landslide-prone areas, and supporting spatial planning to reduce landslide risk [
1,
2,
3]. Various methods and approaches have been proposed to assess landslide susceptibility, such as heuristic or index-based zoning techniques, physically-based models, and statistically based classification methods [
4,
5,
6].
The likelihood of occurrences can be determined by fitting a statistical model to historical landslides, taking into account explanatory factors affecting landslides, such as geology, topography, hydrometry, land use, etc. The main characteristics of statistical models are a high efficiency and better understanding of the relationships between the spatial factors used to identify areas prone to landslides [
7,
8]. Lee [
5] assessed the status of landslide susceptibility mapping based on 776 papers published over a 20-year period (1999–2018) and found that commonly used statistical methods were logistic regression, the frequency ratio, and weights-of-evidence. A review of statistically based modeling of landslide susceptibility, including 565 peer-reviewed articles from 1983 to 2016, was presented by Reichenbach et al. [
6], who noted that the most applied statistical methods for modeling of landslide susceptibility were logistic regression, neural network analysis, and weights-of-evidence.
Weights-of-evidence is a very popular technique for landslide susceptibility mapping because it is easy to use and can easily be incorporated in geographic information systems [
7,
8,
9]; recent examples are [
10,
11,
12,
13,
14,
15,
16]. The purpose of weights-of-evidence is to weigh and combine the controlling factors to predict the probability of landslide occurrence. However, weights-of-evidence is hampered by the assumption of conditional independence of the controlling factors, which is often untrue in practice. Violation of conditional independence between factors has received much attention in geosciences in mineral prospectivity modeling. When there is significant conditional dependency between factors, the probabilities derived from weights-of-evidence are biased and generally too large compared to the observations [
9,
17]. Several attempts have been proposed to account for model bias or to relax the conditional independence assumption, such as modified weighing [
18,
19], additive mixed terms [
20,
21], semi-Naïve Bayes approaches [
22,
23], or machine learning algorithms such as decision trees, random forest, and artificial neural networks [
24,
25,
26,
27]. However, to date, no generally accepted improvement has been found.
Logistic regression is one of the most common methods for modeling landslide susceptibility because it is easy to implement and very efficient for analyzing relationships between a binary response variable and numerical or categorical explanatory variables [
28,
29,
30]. Budimir et al. [
31] presented a review of landslide probability mapping using logistic regression, based on 75 peer-reviewed papers, and concluded that there is no consistent methodology for applying logistic regression analysis for landslide susceptibility. In particular, the method by which explanatory factors or factor classes are selected is often not well explained. Furthermore, the majority of the published papers apply a combination of frequency ratio and logistic regression, where factor classes are replaced by their landslide frequency ratio and logistic regression is only applied to the factors; recent examples are [
32,
33,
34,
35,
36,
37].
Most studies using logistic regression to predict landslide susceptibility provide little or no information on the conditional independence of factors and model bias. In mineral prospectivity modeling, on the other hand, it is well known that logistic regression models always produce unbiased estimates, regardless of whether the controlling factors are conditionally independent with respect to the target variable, as opposed to weights-of-evidence [
19,
38]. Moreover, it is well known that weights-of-evidence and logistic regression produce similar results if the predictor factors are categorical and conditionally independent [
18,
19,
20,
21,
38,
39].
A disadvantage of logistic regression is that estimated regression coefficients can have large variances unless there is conditional independence of the controlling factors [
38]. However, in the case where the factors are categorical, interaction terms in logistic regression models can compensate for violations of conditional independence [
21,
39]. Therefore, combinations of factors or factor classes can be added to the model as additional terms to compensate for the lack of conditional independence of the factors. Additional interaction terms result in a hierarchy of models, where each former model is a special case of the successive latter model and is therefore more restrictive [
39]. However, the practical application of this method has been questioned because the number of additive terms can increase rapidly, so that the estimation of the logistic regression coefficients becomes increasingly difficult, if not impossible, given the accuracy of the numerical solution procedure and the limited number of training data [
20,
38].
In this study, we focus on statistical methods for landslide susceptibility mapping, which predict the conditional probability of landslides with categorical controlling factors, in particular, weights-of-evidence and logistic regression. We investigate how modeling techniques, conditional independence of the factors, and model bias are related. A unique conditions model is proposed that reproduces observed landslide probabilities for any combination of categorical controlling factors, without any bias. The feasibility, strengths, and weaknesses of the modeling approaches are illustrated and tested through application to a practical case study.
2. Materials and Methods
2.1. Preliminaries
We denote the observation of a landslide in the study area with a binary indicator
, such that
indicates the presence and
the absence of a landslide. Similarly, landslide controlling factors are denoted with a set of binary indicators
, where
refers to
factor types and
to
subtypes of factor
, so that
indicates the presence and
the absence of factor class
. We assume that all factors completely overlap the study area and that the classes of each factor do not overlap, so
We also define the unconditional landslide probability , generally denoted as the prior probability; the conditional landslide probability for a single factor class ; and the conditional landslide probability for all factors combined , which is commonly referred to as the posterior probability. When factor class promotes landslides, exceeds , and vice-versa. The same applies to the combined set of factors; when is larger than , the environmental conditions are more favorable for landslides to occur.
Estimates of the prior probability and the conditional probability for a given factor class can be obtained directly from landslide observations as follows:
where
is the total area of the study domain and
is the area occupied by factor class
.
The posterior probability
is not easy to derive from the data, and finding a suitable model to predict posterior probabilities is the main goal of a landslide susceptibility study. Various statistically based methods and approaches have been applied in practice, but little attention is paid to whether estimated probabilities are reliable. Model bias refers to systematic errors, which can result from inaccurate data or from bias in the algorithms used to validate the model. In geosciences, it is common to verify that the mean of the posterior probability is equal to the observed prior probability [
9,
17], so
In addition, one can also verify whether the conditional landslide probability for a single factor class agrees with the observations, i.e.,
Note that if Equation (5) holds, then Equation (4) is also satisfied, because
which can be generalized as: any area partitioned into unbiased sub-areas is unbiased.
However, in practice, models for landslide susceptibility mapping are often biased, resulting in incorrect predictions, which are generally overlooked or ignored.
2.2. Weights-of-Evidence
Weights-of-evidence is a very popular and widely used method for predicting the probability of landslides. In weights-of evidence, the posterior probability
is derived from Bayes’ theorem as
Similarly, the posterior probability for absence of landslides
is derived as
Combined, this leads to an expression for the odds of the presence versus the absence of a landslide:
Using the logit function, this can be rewritten as
Furthermore, weights-of-evidence assumes a conditional independence of the controlling factors, so that the joint probabilities on the right-hand side of Equation (10) can be derived from the product of individual probabilities, so
where
are factor class weights given by
Equation (11) is a statistical equation. To use it as a model for the prediction of landslide probabilities in a domain, it must be reformulated in algebraic form as
clearly showing that the weights only apply if the corresponding factor class is present.
In practice, estimates of the probabilities in Equation (12) can be obtained from observed landslides, so the weights are derived as
showing that there is a one-to-one relationship between the weight
and the observed landslide probability
of a factor class.
Because landslides are rare, landslides may not be observed if the area of a factor class is small, so that , which poses a problem for the application of Equation (14) because the logarithm of zero is infinite. In such a case, one is accustomed to setting the weight equal to zero, although this violates Equation (14) and introduces a model bias because implies that , which contradicts what is observed.
In the case of conditional independence of all factors, Equations (4) and (5) hold, showing that the posterior probabilities are unbiased; the proof is given in
Section 2.3. However, in practice, the conditional independence of controlling factors is generally not guaranteed, so the posterior probabilities obtained by weight-of-evidence are biased. Usually, violation of conditional independence results in posterior probabilities that are too large, and conversely, if the model results are found to be biased, this may be due to a lack of conditional independence of the factors.
2.3. Logistic Regression
Multiple logistic regression is probably the most commonly used technique to predict posterior landslide probabilities. Starting from Equation (13), the idea arises to derive the weights by logistic regression. However, there is a complication, namely that the factor classes are linearly dependent, as shown by Equation (1), which is not allowed in multiple regression. To get around this, one class in each factor must be removed: usually the first class, although any class will do. Therefore, the logistic regression model is formulated as follows:
where
and
are model parameters to be estimated by maximum likelihood, which is a measure of fit between predicted probabilities and the observed data. The log-likelihood is given by [
40] as
Maximum likelihood is obtained by setting the derivatives of the log-likelihood for each parameter equal to zero, so
which can be solved to determine
and
. In practice, this requires specialized software because the logistic regression model is non-linear. Note that these equations are equivalent to Equations (4) and (5), which express the model bias. Therefore, the maximum likelihood solution also ensures that the posterior probabilities predicted by the model are unbiased, which is an important advantage of using logistic regression.
In the case of conditional independence of the factors, logistic regression and weights-of-evidence are equivalent [
21,
37,
39]. To prove the correspondence between weights-of-evidence and logistic regression, eliminate the first class of each factor in Equation (13) by substituting
, so
Comparison with Equation (15) shows the correspondence between the weights and the logistic regression coefficients as
Similar expressions have been presented in the literature, for example [
21,
37,
39].
A final note about logistic regression is that it cannot handle missing data. Therefore, factor classes without observed landslides should be excluded from the model.
2.4. Unique Conditions
One way to avoid conditional dependency is to overlay factors to create combined factor classes, which can improve conditional independence [
20]. Ultimately, all factors can be overlaid and all factor classes combined to identify unique conditions, which can be indicated as
where the subscript
indicates any factor class of factor
in the range 1 to
, and
is a binary indicator such that
indicates the presence and
the absence of a unique combination of factor classes. There are
possibilities for
, but most of these will not occur because many factor classes do not overlap. Furthermore, all occurring combinations are categorical and non-overlapping, and fully cover the study area, so
Therefore, the unique conditions form a conditionally independent set
of controlling factors, so that a model for predicting the conditional landslide probability can be obtained as
where
is the posterior landslide probability, given that the set of unique conditions
and
are weights that can be obtained from landslide observations, similarly as in Equation (14), so that
where
is the frequency of landslides observed in area
occupied by factor
, given by
When Equation (25) is inserted into Equation (24) and Equation (23) is used, it follows that
This shows that the predicted posterior probability is constant in the area occupied by and equal to the probability observed in that area. Since the study area is completely covered by the set and each factor class is covered by a subset of , Equations (4) and (5) apply, which shows that there is no model bias. Furthermore, the model is unbiased in any subset that can be composed of . Therefore, we can assume that there is no other model that can produce better results than this.
When all factors are conditionally independent, the weights-of-evidence model and the unique conditions model are equivalent, because both methods solve Equation (10) exactly and predict the same posterior probabilities: the latter by combining the observed probabilities of all possible combinations of the factor classes, and the former by combining the observed probabilities of the factor classes, which should lead to the same result if the factors are conditionally independent. Since the unique conditions model is unbiased, weights-of-evidence must also be unbiased because the results are the same if the factors are conditionally independent.
4. Discussion
The theoretical developments and illustrative examples show that conditional independence of the controlling factors and model bias are related, as also reported in the literature [
17,
18,
19,
20,
21,
22,
38,
39]. In this study, it is clearly shown that in the case where controlling factors are conditionally independent, the weights-of-evidence, logistic regression, and unique conditions model are equivalent, meaning they will yield the same posterior probabilities. Therefore, in practice, one can choose any of these methods based on simplicity of the technique or the skill and experience of the user. The equivalence between weights-of-evidence and logistic regression in the case of conditional independence of the factors has been demonstrated in other studies [
18,
19,
20,
21,
38,
39], but the equivalence with the unique conditions method is a new contribution from this study.
Conditional independence of the controlling factors is, in practice, the exception rather than the rule. When there is no conditional independence, weights-of-evidence produces biased posterior probabilities, which is usually ignored or disregarded in practice, especially in landslide susceptibility studies, where weights-of-evidence has proven to be a very popular technique [
4,
5,
6]. On the other hand, logistic regression provides unbiased posterior probabilities for individual factor classes and the overall study area, but not for higher levels when factors are combined. Methods have been proposed in the literature to improve weights-of-evidence and logistic regression by including so-called mixed terms, that is, combinations of factors, based on trial and error or search algorithms that improve the likelihood of the predictions [
4,
21,
33,
35]. Such methods may be justified, but it seems likely that their results will never match the results of the unique conditions model.
The above discussion is further illustrated by considering the ROC curves and AUC values obtained with the different models, as shown in
Figure 4. When only landslide susceptibility classification is considered, weights-of-evidence and logistic regression perform almost equally. Because weights-of-evidence is easier to perform, it may be preferable in practice if only classification is pursued. However, none of these methods can produce the discriminatory power of the unique conditions model. Moreover,
Figure 3 shows that the posterior probability map obtained with the unique conditions model is more detailed and covers the entire range of probabilities from zero to one, while for logistic regression, the predicted probabilities range only from zero to a maximum of 0.36, and for weights-of-evidence from zero to 0.81—the latter likely due to overestimation due to the model bias.
It is generally accepted that direct estimation of conditional probabilities for all combinations of controlling factors is infeasible due to the excessive computational requirements. When the controlling factors consists of
n binary patterns, 2
n unique combinations are possible, making it very difficult, if not impossible, to directly estimate the conditional probabilities of all combinations [
17,
20]. For instance, in the present case, there are a total of 47 factor classes, which would imply more than 10
14 possible combinations. In practice, however, this is not the case, because classes of a same factor do not overlap. In such a case, the possible unique combinations reduce to the product of the number of classes in each factor. In the present case, this would amount to 629,856 possible combinations, which is still a large number. However, in addition, not all classes of different factors overlap, so the actual number of unique combinations may be much smaller. Hence, the trick is to consider only the combinations that actually occur and ignore the rest. This can be achieved, as shown in
Figure 2, by comparing the combination of factors in a unit area with all other unit areas in the study domain and, if the conditions match, counting the number of unit areas and the number of observed landslides for this combination, making it possible to estimate the prior landslide probability using Equation (26). Because the unique combinations of all factors do not overlap, they are conditionally independent, such that the posterior probabilities are equal to the observed prior probabilities as given in Equation (27). The numerical derivation can be tedious if the unit area is small relative to the total domain. In the present case, there are 310,649 unit cells, which is large but achievable with modern computing power.
The number of unique conditions actually occurring is 28,605, which is much less than the theoretical possible number of combinations. The size of the unique conditions areas ranges from 1 to 467 grid cells, i.e., 400 m2 to 187 ha. The average size of the unique conditions area is 4.34 ha. In general, the areas with unique conditions are quite small, so in many of these areas, no landslide has been observed. In fact, 66% of the total domain appears to be free of landslides. This could be interpreted as missing data, and the posterior probability could be set equal to the prior value. However, this would conflict with the unbiasedness of the model, so we chose to be consistent and set the posterior probability equal to zero. Furthermore, 76% of the domain is found to have a posterior probability lower than the prior probability, and thus can be assumed to have low landslide susceptibility. There are also 171 unit areas found with unique conditions and observed landslides, implying a landslide probability of one. These represent 3% of all observed landslides in the study area.
At first glance, one might conclude that the unique conditions model does not provide information about the importance of each factor class. However, this is not the case, because the model is unbiased and precisely predicts the average posterior landslide probability in each factor class area. Because these are equal to the observed landslide probabilities, the values or the corresponding weights are measures of the importance and predictive power of each factor class. Such information could be used prior to modelling to discard factors with classes that exhibit little or negligible discriminatory power. For instance, in the present case, one might decide to remove the slope shape factor because it has low values. This reduces the number of unique conditions to 13,341, which can save computing time. The size of the unique conditions areas now ranges from 1 to 1090 grid cells, i.e., 400 m2 to 436 ha, with an average of 11.4 ha. Now, 54% of the total area is free of landslides and 74% has a posterior probability that is lower than the prior probability. However, the resulting posterior landslide probabilities are not much different from the previous results and the AUC value becomes 0.90. Removing the slope shape factor therefore has little effect, apart from the gain in computing time.
Landslide probability estimates are not well suited for mapping landslide susceptibility, because the distribution over the study area is very skewed and there is no clear rule for classifying the probability values in landslide susceptibility categories. Therefore, we propose a landslide susceptibility index (LSI), similar to [
23], defined as
which for the unique conditions model becomes
Such a classification has the advantage of being easy to interpret, since positive values indicate areas prone to landslides and conversely, negative values indicate areas less prone to landslides. Moreover, landslide susceptibility classes can be easily obtained without subjective judgment by dividing the LSI values into equal intervals.
The resulting LSI map for the present case is shown in
Figure 5. Since the probability values can be zero or one, the weights can go to infinity. Therefore, the weights are limited to a range of −10 to +10. The LSI map shows much more detail than the posterior landslide probability maps. LSI values around zero are represented by the yellow color and correspond to areas that are no more or less prone to landslides than observed. Negative LSI values, represented by the green and blue colors, indicate areas not prone to landslides. These take up a large part of the basin. On the other hand, areas with orange and red colors are prone to landslides and cover only some small parts of the basin. Thus, transformation of posterior landslide probability into corresponding LSI values allows a simple and clear interpretation of landslide susceptibility.
5. Conclusions
We examined three statistical methods for landslide susceptibility mapping, which predict the conditional probability of landslides with categorical controlling factors—weights-of-evidence, logistic regression, and a unique conditions model—by considering all possible combinations of the controlling factors. The strengths and weaknesses of the models were illustrated and tested through application to a practical case study.
It is shown that when all factors are conditionally independent, all models are equivalent and result in unbiased predictions of the posterior landslide probability. When there is a conditional dependency between factors, the posterior probabilities derived from weights-of-evidence are biased and generally too large compared to the observations. However, the bias of the weights-of-evidence has little effect on its discriminatory power. On the other hand, logistic regression produces unbiased estimates with respect to the factor classes and the overall study area, regardless of whether the controlling factors are conditionally independent.
The unique conditions model is always unbiased because the unique condition areas do not overlap and are therefore conditionally independent. Therefore, this model predicts the landslide probabilities without bias in the total domain, in all factor classes, and in any area that can be composed by combining factor classes. Moreover, the unique conditions model outperforms the other models because the discriminatory power is much higher, and the AUC value is close to one. The application of the unique conditions model can become computationally cumbersome if there are too many controlling factors and overlaps. This can also lead to unique conditions areas becoming too small to be meaningful. To avoid this, the most important factors can first be selected based on the observed landslide probabilities.
Because landslide probability estimates are not well suited for landslide susceptibility mapping, we propose a landslide susceptibility index that has the advantage of being easy to interpret without subjective judgment.
Although the landslide dataset used in this study does not include all possible landslide-causing factors, the results of this study are promising and show potential for broader practical use. However, quality and quantity of input data are important for achieving good results, so further research is necessary. We therefore recommend that future research consider other field cases with more complete thematic layers and provide a geomorphological evaluation and cross-validation of the predictions. In future research, we also recommend validating the findings of this work with other innovative data-driven methods such as machine learning and/or deep learning models.