1. Introduction
Pipelines, which are the oil and gas industry’s backbone, convey petroleum products in a variety of settings (i.e., onshore or offshore) [
1,
2]. The first oil pipeline, constructed in Pennsylvania in 1879, was 109 miles long and 6 inches in diameter [
3]. Over 2 million miles of pipeline have been built in 120 countries around the world. The United States has 65% of the total pipeline length in the globe, followed by Russia at 8% and Canada at 3%. The three countries account for about 75% of the pipeline’s overall length [
4]. As of 2020, there are 491 functioning oil pipelines around the world [
5]. Over 46% (19,122 miles) of worldwide oil and gas pipelines lie in Asia-Pacific, while Canada is only projected to contribute 6% of the pipeline construction [
6].
Pipelines are the safest means to carry petroleum products when compared to rail and roadways. However, pipelines are prone to various failures under diverse circumstances, leading to catastrophic environmental consequences owing to oil spilling as well as substantial economic losses due to production stoppage [
7]. The social and economic prosperity of a country is associated with pipeline safety and security. Pipeline failures are caused by mechanical, corrosion, natural hazards, operational, and third-party sources, according to the Conservation of Clean Air and Water in Europe (CONCAWE), a European organization that investigates environmental, health, and safety issues for the oil industry. CONCAWE was launched in 1963 by a consortium of top oil firms to conduct environmental studies related to the oil sector [
8]. As a result, timely inspections and checks of the pipeline condition are required to avoid accidents and failures [
9].
Inspection techniques have been applied to discover pipeline anomalies and flaws without shutting down production. In order to overcome the significant cost and time required by these inspection techniques, numerous studies have been undertaken to examine the condition, diagnose failure causes, and anticipate the residual lives of pipelines. Some failure prediction models were founded on subjective assessment, making them susceptible to different opinions. For instance, Kabir et al. [
10] established a safety assessment model for oil and gas pipelines using a fuzzy Bayesian belief network. The model represented event dependencies, updated probabilities, and random, vague, and ambiguous knowledge. According to the results of the sensitivity analysis, the most significant causes of oil and gas pipeline failures included overload, construction fault, poor installation, mechanical damage, and worker quality. Li et al. [
11] examined the likelihood of third-party failure to an urban gas pipeline using the analytic hierarchy process (AHP) and fuzzy mathematics. To identify hazards of third-party damage, a fault tree that identified fundamental events was developed. The basic event probability was evaluated utilizing the expert judgment approach and the fuzzy membership function. Using the AHP, the weight of each expert was determined, the opinions were modified, and the third-party failure probability of the pipeline was computed. Some other condition assessment models were constrained by the limited number of historical records on which they were based (e.g., [
12,
13]). This might hinder the application of the developed models to other pipelines [
14].
The last category of models was concerned with examining specific failure causes of oil and gas pipelines using machine learning approaches. El-Abbasy et al. [
15] predicted the condition of oil and gas pipelines based on historical data from three offshore pipelines in Qatar. The model accounted for several factors such as age, diameter, metal loss, crossings, cathodic protection, operating pressure, free spans, anode wastage, and condition of coating, joint, and support. With regard to pipeline size and type of transported product, the artificial neural network (ANN) approach was employed to develop five condition prediction models. The developed models had coefficients of determination (R
2) ranging from 0.9904 to 0.9959. Additionally, they were able to accurately forecast pipeline conditions with an average validity percent (AVP) of over 97%. Finally, a sensitivity analysis was performed to analyze the impact of each factor on pipeline condition. Cathodic protection and metal loss were associated with the highest positive and negative influence on pipeline condition, respectively. Diameter and crossings, on the other hand, were determined to have the least positive and negative effects on pipeline condition. Senouci et al. [
16] established regression analysis and ANN models that could forecast the cause of oil pipeline breakdown based on specific predictors, namely facility, diameter, age, service type, and land use. With an AVP of 90% for the regression model and 92% for the ANN model, the two models were able to forecast pipeline failures owing to mechanical, operational, corrosion, third-party, and natural hazards. The sensitivity analysis showed that facility and service predictors had the highest contribution to the pipeline failure cause. In this study, failure source was regarded as a prediction problem rather than a classification challenge, which may raise concerns about the reliability of results.
Shaik et al. [
9] proposed the application of the ANN approach to predict the condition of a crude oil pipeline based on particular criteria such as pressure flow, metal loss, weld anomalies, and wall thickness. With an R
2 value of 0.9998, the model with 16 hidden neurons accurately predicted the estimated repair factor. The deterioration profiles of the elements were constructed to determine the individual impact on pipeline condition. It was discovered that pressure had a significant negative impact on pipeline quality, whereas weld anomaly had a minor negative impact. Zakikhani et al. [
17] anticipated failure sources in oil pipelines based on physical, environmental, and operational factors. With an AVP of 73.7%, an ANN model was developed for predicting mechanical, corrosion, and third-party failures. Another ANN model with an AVP of 72.8% was constructed to forecast corrosion and third-party failures. In addition, a multinomial logit (MNL) regression model with an AVP of 73.7% was established for predicting mechanical, corrosion, and third-party failures. It is worth mentioning that the results obtained by ANN and MNL approaches were identical. However, the MNL model determined the likelihood of each failure source, assisting decision makers in identifying the most likely and critical failure sources. Concerning sensitivity analysis, product type and pipeline age had the greatest and least impact on the failure category, respectively. In another study, Zakikhani et al. [
18] conducted failure prediction models for exterior corrosion in subterranean gas transmission pipelines, taking into account both environmental/geographical and traditional factors. Multiple regression analysis was used on the available historical data for gas transmission pipelines. The constructed models had root mean square error (RMSE) values of 0.04 and 0.07, and R
2 values of 0.93 and 0.75, respectively, in the validation testing phase.
The limitations of the previous research studies could be listed as follows [
19]: (1) subjectivity and reliance on an expert judgment that necessitated costly experiments/inspections, hindering the generalized application to all pipelines; (2) simplicity and conservation of the used approaches, highlighting the gap between research and practice in oil and gas pipeline failure prediction; (3) restriction to specific failure causes of oil and gas pipelines. In other words, they lacked impartiality in anticipating the various pipeline failure types; (4) consideration of failure source as a prediction rather than a classification problem, which may raise concerns about the reliability of results; and (5) utilization of limited records based on few in-line inspections, which limited model application to pipelines with different characteristics.
In an attempt to overcome these shortcomings, the primary objective of this research study is to develop objective prediction models for identifying different failure categories in oil and gas pipelines based on previous failure incidents. The models are established using a multilayer perceptron (MLP) neural network, radial basis function (RBF) neural network, and multinomial logistic (MNL) regression to classify corrosion and third-party failures. Findings show that these failure categories accounted for more than 70% of total oil pipeline accidents. The developed models take into account the significant factors that influence the condition of pipelines such as pipe diameter and age, service, facility type, and land use. The robustness of the proposed model has been compared to that of earlier approaches. This research assists pipeline operators in taking the required precautions and preventative actions to avoid catastrophic disasters in the oil and gas industry.
The major contributions of this research are identified as follows:
Introducing the application of RBF neural network to classify different failure types for oil and gas pipelines.
Conducting a thorough comparison of three different failure prediction models for oil and gas pipelines.
Enhancing the AVP value reported in the literature for the developed MLP, RBF, and MNL models by 15.4%, 16.8%, and 11.3%, respectively.
2. Failure Sources in Oil and Gas Pipelines
The CONCAWE database has classified oil pipeline failures into five categories [
20]: mechanical, corrosion, operational, third-party, and natural hazards.
Figure 1 illustrates the contribution of these failures based on data reports from CONCAWE [
21].
Mechanical failure is caused by design flaws, material faults (e.g., inappropriate or low-quality materials, and incorrect material specification), or construction problems (e.g., poor workmanship, inadequate support, and faulty weld) [
22]. These defects can be deformations in the pipe wall in the form of dents and gouges [
23]. Dents are radial deformations, whereas gouges are deformations along the pipe surface. This failure type can cause immediate or delayed failure, depending on its severity.
Corrosion is a slow process that results in the loss of metal in the wall, resulting in pipeline failure [
24]. Corrosion is the second most common cause of pipeline collapse, according to the U.S. department of transportation. It is divided into internal and external corrosion, as well as stress cracking corrosion. Internal corrosion affects the inner surface of a pipeline and is usually caused by the material being conveyed. It is influenced by two key factors, namely product corrodibility and corrosion intervention. On the contrary, external corrosion occurs as a result of subsurface or atmospheric factors in buried and above-ground pipelines, respectively [
25]. Due to its intricate mechanism, subsurface corrosion is more destructive than atmospheric corrosion. Cathodic protection and pipeline coating can help to delay its occurrence [
26]. Due to the combined effects of corrosion and tensile stress, material cracking occurs as a result of stress crack corrosion [
27].
Operational failure results from operator errors, operational upsets, and failures or inadequacies in safeguarding systems [
28]. This failure type is uncommon, despite having disastrous repercussions. In addition to pressure monitoring, the deployment of safety devices, supervisory control and data acquisition communications, and other methods may help to prevent operational failures [
29].
Third-party failure is caused by events unrelated to the pipeline [
30]. Intentional or accidental third-party operations are the most common failure source in oil pipelines, despite being the least studied factor in pipeline hazard assessment [
31]. Cover depth, coating, and public education are among the factors that influence third-party damage.
Natural hazards such as volcanic activity, lightning strikes, earthquakes, land displacement, and flooding are uncommon [
32]. To avoid this type of failure, geotechnical and hydrotechnical investigations are conducted before pipeline installation.
4. Data Collection
Failure records provided by CONCAWE in 2019 are used to develop the failure prediction models [
21]. The data included 49 years of spillage data dating back to 1971, as well as 36,000 km of pipelines transporting 620 million m
3 of crude oil and petroleum products per year across Europe. A total of 73 agencies and organizations operating around 35,691 km of oil pipelines provide annual data for the CONCAWE study. In 2019, the total transported amount of crude oil and processed products was roughly 619 mm
3 while the overall traffic volume was anticipated to be 119 × 109 m
3 per km.
Six spillage incidents were recorded in 2019, equivalent to 0.18 spillages per 1000 km of line. This value is much lower than the annual average of 0.44, which has been declining from a value of 1.1 in the mid-1970s. Two out of the six recorded incidents were caused by mechanical failures, one by operational issues, three by corrosion, and none by natural hazards and intentional or accidental third-party activity. There have been no recorded injuries, deaths, or fires as a result of these spills. The gross spillage volume was 961 m3 (28.3 m3 per 1000 km of pipeline), compared to the long-term average of 62 m3 per 1000 km. It was reported that 93% of the spillage volume was collected or disposed of securely.
CONCAWE database includes 586 records for the five different failure causes (i.e., mechanical, operational, corrosion, natural hazards, and third-party). It is noted that a total of 232 event records are lacking data for certain factors. As a consequence, these incidents are removed from the database, maintaining 354 incidents with complete data. The utilized dataset comprises 253 accidents owing to corrosion and third-party activities. Accidents caused by mechanical, operational, and natural hazards are not recorded in the dataset due to their low probabilities of occurrences.
Table 1 depicts a sample of the database used for building the model. Each spilled incidence represents a unique instance and is distinguished by five distinct characteristics in addition to the primary cause/type of failure. Pipeline diameter, service type, facility type, age, and land use are all considered explanatory variables. The model excludes the gross and net loss spillage volume, leak detection method, and facility part variables. This can be attributed to the impossibility of determining these variables before a failure occurs, yet the established model is designed to anticipate the failure cause before its occurrence.
5. Model Development
The failure prediction model development is illustrated in
Figure 4. The data extracted from the CONCAWE study is utilized to estimate the condition of oil pipelines. MLP and RBF neural networks, as well as MNL regression models, are developed to forecast different failure types using SPSS 28 statistical software [
52]. To build the models, the dataset is randomly divided into 70% and 30% for training and validation, respectively. As depicted in
Table 2, the input factors (i.e., diameter, service, facility, age, and land use) are the key predictors of the developed models, whereas the main output is the failure type. The three qualitative factors (service, facility, and land use) have been incorporated into the model after being converted into numeric values. Furthermore, the other two quantitative parameters (age and diameter) have varying units of measure. As a result, the values of the input and output factors must be normalized. As a consequence, the models are designed to forecast the failure type based on various combinations of input categories.
5.1. MLP and RBF Models
The network architecture is influenced by the selected input and output variables. Each variable is represented by a single artificial neuron. As a result, the network architecture has five neurons in the input layer and two in the output layer. The appropriate number of hidden neurons is determined after conducting several iterations. For MLP, the activation functions in the hidden and output layers are hyperbolic tangent and softmax, respectively. Additionally, the stopping rule used is one consecutive step with no error decrease. By adjusting connection weights, a scaled conjugate gradient optimization algorithm is employed to minimize the objective function (error). Meanwhile, for RBF, softmax and identity are the activation functions in the hidden and output layers, respectively.
5.2. MNL Model
The likelihood of each output category (i.e., failure type) is computed using multinomial regression, which helps to identify the most likely and critical failure sources. The highest probability is assigned as the anticipated value. Each category is associated with a baseline in the logit model, and third-party failure is utilized as the baseline in this research. The model performance is measured using the maximum-likelihood estimate, which indicates the similarity between the observed and modeled parameter values. It is commonly equal to 2 log-likelihood (2 LL) [
50,
53].
Several pseudo-R squares are utilized to evaluate the goodness of fit for the MNL models, as per Equations (6)–(8) [
54,
55,
56]. These metrics resemble R-square such that they range from 0 to 1, with higher values indicating better model fit and vice versa.
where
and
denote models without and with predictors, respectively. Furthermore,
denotes the estimated likelihood, and
denotes the number of data points.
The initial likelihood for the reduced model, which omits the effect of the investigated variable, is estimated to determine each predictor’s importance. This probability is compared against the reported results when considering all predictors (full model). The chi-square for each predictor is then determined by subtracting the full model value from the reduced model value. The predictor is deemed significant when it is associated with high chi-square and low significance values.
7. Results and Discussion
MLP and RBF neural networks, as well as MNL regression models, are developed to forecast corrosion and third-party failures. For MLP and RBF models, the optimum number of hidden neurons is determined to be three and ten neurons, respectively. As depicted in
Table 3, the findings of the importance factor analysis for MLP reveal that the order of importance of the predictors is listed in the following order: facility, service, land use, age, and diameter. On the other hand, for the RBF model, the predictors are arranged in the following order of importance: service, facility, land use, age, and diameter.
The receiver operating characteristic (ROC) curve is a diagnostic method for evaluating classification problems. To evaluate classifier performance in differentiating positive and negative data, it plots the true positive rate versus the false positive rate. A bigger area under the ROC curve suggests a better likelihood of classification as a positive rather than a negative value.
Table 4 demonstrates that the area under the ROC curve for each dependent variable category in the MLP model is often greater than 0.7, indicating good prediction accuracy. However, for the RBF model, the area under the ROC curve = 0.8, indicating very good prediction accuracy.
As shown in
Table 5, the likelihood function value for the MNL model without independent variables is 305.642, whereas the value with all independent variables is 293.395. Due to the inclusion of independent variables, a decrease in this value reflects improved model prediction. The chi-square (12.246) has a significance of 0.032, indicating a statistically significant association between the explanatory and response variables [
53]. Moreover, the pseudo–R square findings are summarized in
Table 6.
Table 7 summarizes the likelihood ratio test analysis findings. It reveals that the most influential variable is facility type because it is associated with the highest chi-square (5.891) and lowest significance (0.015) values.
As previously explained, the MNL regression model is based on calculating the likelihood of each failure type that is based on computing the logit of each output. In this context, the variable coefficients of the dependent variables are depicted in
Table 8. The logit for corrosion and third-party failures is calculated using Equations (12) and (13).
Finally, the likelihood of each failure source is computed using Equations (14) and (15).
Table 9 summarizes the results of the training and validation phases for the MLP, RBF, and MNL models. For the first approach, the AVP values for the MLP, RBF, and MNL models are 0.84, 0.86, and 0.80, respectively, in the training phase. Meanwhile, for the validation phase, the AVP values are 0.85, 0.83, and 0.82 for the MLP, RBF, and MNL models, respectively. For the training and validation phases, the MLP, RBF, and MNL models predict failure causes with AVP values of 0.84, 0.85, and 0.81, respectively. The average validity percentage in all models is above 0.80, indicating very good classification accuracy. The findings confirm the robustness of the developed RBF model and its ability to forecast pipeline failure based on a set of input variables. However, the prediction accuracy of models could have been compromised due to the non-availability of some important factors (e.g., thickness, operating pressure, and yield strength) that contribute to oil and gas pipeline failure. Furthermore, due to confidentiality concerns, access to a significant number of failure records in the oil and gas industry is often difficult.
For the training phase, the findings of the second approach indicate that the AVP values are 0.74, 0.77, and 0.70 for the MLP, RBF, and MNL models, respectively. Meanwhile, for the validation phase, the AVP values are 0.79, 0.70, and 0.72 for the MLP, RBF, and MNL models, respectively. It should also be noted that the AVP values acquired using the second approach are lower than those reported using the first approach. The reason is that, in the second approach, the event is utterly incorrect if the anticipated failure type differs from the actual one. On the contrary, the first approach adopts estimating the deviation between the actual and modeled failure types. Despite this, the second approach produces satisfactory AVP results for the models.
Figure 5 illustrates the residual plots for the actual and modeled failure types using classification models. The mean of errors is −0.37, −0.26, and −0.45 for the MLP, RBF, and MNL models, respectively. Meanwhile, the standard deviation of the measured errors ranges between 0.91 and 1.05 for the classification models. This figure shows that the predicted values of the three models are within acceptable bounds and are distributed around the actual values. The MLP and RBF models outperform the MNL regression model in terms of accuracy because they take into account the nonlinear relationship between the dependent and independent variables, as well as the correlation between the parameters that determine the pipeline failure cause.
The outcomes of the established models are compared to the results reported in the literature. Zakikhani et al. [
17] presented an ANN model that was associated with an AVP of 0.728 for forecasting corrosion and third-party failures in oil pipelines based on physical, environmental, and operational factors. In this research, the developed MLP, RBF, and MNL models are associated with AVP values of 0.84, 0.85, and 0.81, respectively. This implies that the proposed MLP, RBF, and MNL models enhanced the AVP value reported in the literature by 15.4%, 16.8%, and 11.3%, respectively. Therefore, the proposed RBF model outperforms previously published models, emphasizing its robustness and accuracy capabilities.
8. Conclusions
This research study developed three models for predicting corrosion and third-party failures in oil pipelines, taking into account several predictors such as age, diameter, facility, service, and land use. Findings showed that these failure categories accounted for more than 70% of total oil pipeline accidents. The models were developed using multilayer perceptron (MLP) neural network, radial basis function (RBF) neural network, and multinomial logistic (MNL) regression. The importance factor analysis for MLP revealed that the order of importance of the predictors was as follows: facility, service, land use, age, and diameter. On the other hand, for the RBF model, the predictors were arranged in the following order of importance: service, facility, land use, age, and diameter. For MNL, the likelihood ratio test analysis revealed that the most influential variable was facility type because it was associated with the highest chi-square and lowest significance values. Moreover, the model calculated the likelihood of each failure source, assisting decision makers in determining the most likely and critical failure sources. For MLP and RBF neural networks, the area under the receiver operating characteristic (ROC) curve for each dependent variable category was about 0.7 and 0.8, respectively, indicating good prediction accuracy. The MLP, RBF, and MNL models predicted failure causes with AVP values of 0.84, 0.85, and 0.81, respectively. The developed models were tested for robustness against other previous models based on AVP value. It was found that the proposed MLP, RBF, and MNL models enhanced the AVP value reported in the literature by 15.4%, 16.8%, and 11.3%, respectively. Therefore, the proposed RBF model outperformed previously published models, emphasizing its robustness and accuracy capabilities. This can be attributed to the fact that the RBF model accounted for the nonlinear relationship between the dependent and independent variables, as well as the correlation between the parameters that determined the pipeline failure cause. The established models provide decision makers with a clear picture of the failure sources that endanger a pipeline, allowing them to mitigate risks and ensure pipe safety. These models can assist oil pipeline operators and decision makers in planning pipeline safety by forecasting how pipelines will break based on specific physical, operational, and environmental features. It is recommended in the future to examine the performance of the developed models for predicting other failure types (e.g., mechanical, operational, and natural hazards) in oil and gas pipelines.