Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection

Zadehbagheri, Omid; Salehizadeh, Mohammad Reza; Naghavi, Seyed Vahid; Moattari, Mazda; Moshiri, Behzad

doi:10.3390/modelling5040069

Open AccessArticle

Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection

by

Omid Zadehbagheri

¹

,

Mohammad Reza Salehizadeh

^1,*

,

Seyed Vahid Naghavi

²,

Mazda Moattari

¹

and

Behzad Moshiri

^3,4

¹

Department of Electrical Engineering, Marvdasht Branch, Islamic Azad University, Marvdasht 1477893780, Iran

²

Engineering Division, Research Institute of Petroleum Industry, Tehran 1485613111, Iran

³

School of ECE, College of Engineering, University of Tehran, Tehran 1439957131, Iran

⁴

Department of ECE, University of Waterloo, Waterloo, ON N2L 3G1, Canada

^*

Author to whom correspondence should be addressed.

Modelling 2024, 5(4), 1339-1364; https://doi.org/10.3390/modelling5040069

Submission received: 20 August 2024 / Revised: 14 September 2024 / Accepted: 20 September 2024 / Published: 24 September 2024

(This article belongs to the Topic Oil and Gas Pipeline Network for Industrial Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In the oil industry, the leakage of pipelines containing hydrocarbon fluids causes significant environmental and economic damage. Recently, there has been a growing trend in employing data mining techniques for detecting leaks. Among these methods is the Hidden Markov Model, which, despite good results with stationary data, becomes inefficient when a leak causes a drop in the pressure or flow, reducing its accuracy. This paper presents an adaptive Hidden Markov method. Previous methods had low accuracy due to insufficient information for accurate leak detection. They often classified the size and location of leaks broadly. In contrast, the proposed model extracts hidden features to accurately identify the location and size of leaks, even in noisy conditions. Simulating a leak in a section of an oil pipeline in the Iranian Oil Export Corridor demonstrates the proposed method’s superiority over common methods like K-NN, SVM, Naive Bayes, and logistic regression.

Keywords:

leak detection; hidden Markov model; regression model; pipeline; oil and gas simulator (OLGA)

1. Introduction

1.1. Background and Literature Review

1.1.1. Importance of Leak Detection

Pipelines are one of the most pivotal channels for transporting oil and gas. Consequently, regular inspections of the pipelines are essential. Leaks in gas and oil pipeline networks result in significant economic losses and environmental pollution. For instance, from 2015 to 2020, 954 pipeline leaks were reported in the United States, causing financial damage of USD 1.11 billion. Consequently, proper measures for detecting leaks have gradually attracted researchers’ attention in recent years. Millions of kilometers of pipelines act as crucial conduits in the global oil and gas industry, representing 3% to 7% of the total transported oil or gas [1]. Oil leakage can occur due to several reasons, including unauthorized diversions/branches from pipelines that may be due to pipeline theft.

Ruptures and leaks might continue to happen, and operators must provide systems for leak detection to mitigate the ramifications. This, in turn, could reduce the risk for the pipeline operator. A steady decline in the number of spills has been evident over the last two decades as reported by the Department of Transportation Office of Pipeline Safety (DTOPS). Nevertheless, since 2001, the overall average number of spills seems to have leveled off at a rate of about 100 incidents annually. Even though the yearly reports may vary below and above the 2020 value, on average, a flattened-out curve is observed. The reason for the termination of the declining trend from 2001 to 2020 has not been specifically identified [2]. Therefore, the search for effective leak detection methods is crucial given the potential harm caused by pipeline damage.

1.1.2. Existing Methods and Research Gaps

In recent decades, numerous techniques for monitoring the structural health of pipelines have been developed to detect such damage. As a result, a wide range of validated and established methods now exist for locating pipeline damage. These methods can be categorized in various ways, such as hardware- vs. software-based, direct vs. indirect detection, and internal vs. external detection. Leak detection commonly employs both software- and hardware-based methods [3].

Leak detection methods based on hardware include magnetic flux, characteristic impedance, radioactive tracing, gas infrared imaging, and self-organizing wireless sensor detection [4]. While hardware-based methods have improved over the years and proven useful, they can still be complex and time-consuming. On the other hand, software-based methods have not improved as much and have been slower to develop, with current systems in the market still using design concepts from the 1980s [4]. Software-based methods cover wavelet analysis, support vector machine, flow balance, negative pressure wave, statistical, fuzzy neural network, generalized correlation analysis, sound wave, and real-time model detection methods [5].

The pressure profile of the pipeline plays a crucial role in leak detection. Leaks can be identified by a sudden drop in pressure, followed by a partial recovery. This pressure pulse travels both downstream and upstream in a wave-like pattern through the pipeline. Accurately addressing the velocity of hydraulic transient wave propagation, and detecting and locating leaks, is essential for the automated supervision of the pipeline. To monitor pressure profiles, operators deploy sensors at different locations along the pipeline. These readings are compared with pressure profiles generated from fluid flow modeling, which incorporates pipeline friction factors, the Darcy–Weisbach equation, and fluid flow parameters. By comparing these values, operators can detect anomalies that indicate leaks. However, this method does not pinpoint the leak location, necessitating further verification and assessment before making a shutdown decision. The complexity and length of pipeline networks, with multiple pumps and compressors, cause anomalies and delays in the pressure profile. Current methods for detecting pipeline leaks, in addition to location techniques, are generally categorized into two primary approaches under the internal method: real-time transient modeling and pipeline data modeling. These methods utilize steady-state pipeline models to predict unknown parameters and more accurately detect and locate leaks. The pressure distribution in the pipeline and leak location is predicted based on the flow and pressure rate signals at both ends of the pipelines using pipeline-model-based methods. In [6,7,8], the data-based approach has been implemented using neural networks and fuzzy logic to classify leak size and distance. The model-based approach is implemented using state estimation along with the calibration of the fitting loss coefficient as described in [9].

One of the important topics in the control literature is model-based fault detection and identification, which has attracted researchers’ attention recently [10]. If we consider a leak as a fault in the pipeline network, reviewing the methods used in this regard can be effective. On the other hand, data mining algorithms are widely used in many industries, including the energy industry. They have already been implemented as computer-aided leak detection methods. Various algorithms can be introduced to detect breaks to increase reliability and improve accuracy and sensitivity. Applying machine learning techniques to leak detection opens the opportunity to investigate the size and type of leaks that are undetectable by traditional methods [11]. With the advancement of machine learning, there has been a growing focus on methods based on statistical and probabilistic models that can effectively handle signal uncertainty. Support vector machines (SVMs) are a commonly used supervised learning algorithm for classification tasks. SVMs are robust and adaptable, suitable for linear and non-linear classifications, regression, and outlier detection. Qu et al. proposed a real-time monitoring and pipeline leak detection system using SVM [12]. The k-nearest neighbors (K-NN) algorithm utilizes feature similarity to predict values for new data points and is widely recognized as one of the most popular algorithms in machine learning and data mining, both in industry and academia. K-NN has also been effectively applied in leak detection [13]. Bayes is an approach for classifying phenomena based on their probability of occurrence or non-occurrence [14]. The Naive Bayes algorithm is applied to detect faults in different applications [15,16]. Additionally, deep learning methods have been employed in processing acoustic signals for leak detection, such as deep belief networks (DBNs) combined with genetic algorithms (GAs), which offer potential improvements in detection accuracy [17]. A Hidden Markov Model (HMM) is a statistical model that can be viewed as a simplified dynamic Bayesian network [18]. Amongst the statistical and probabilistic models, Markov and Hidden Markov Models (HMMs) are more capable of classifying and modeling patterns, particularly for signals that exhibit non-stationary behavior and low repeatability, which creates an effective ability to detect faults [19,20,21]. The Hidden Markov Model (HMM) is a combination of two stochastic processes: one observed and the other, namely the hidden process, unknown. The hidden process can be viewed via its produced sequence of observations. In [22], new connections between Hidden Markov Models and undirected graphical models are given. There are two reasons for using these models: (a) understanding and predicting an unknown process using another observable process and (b) explaining the variation of the unknown process based on observational changes [23]. The most well-known applications of HMMs include temporal pattern recognition (e.g., speech recognition), signal processing, novel gene clustering, handwriting recognition, gesture recognition, speech tagging, musical scores, prediction of water consumption, cyberattack detection in petroleum systems, acoustic leak detection, DNA sequencing, pipeline safety monitoring, and robotics [24]. Markov models are also used in the literature for leak identification [25]. Ai et al. created a pipeline monitoring and leak detection system that employs Hidden Markov Models (HMM) to identify acoustic signals indicative of damage. [26]. Qiu et al. proposed an early-warning model of the compressor unit in the equipment chain of the gas pipeline based on a neural network and HMM, which exhibited superior generalization accuracy [27]. Ai et al. introduced a leak detection and pipeline monitoring system using the Linear Prediction Spectrum Coefficient and HMM approach for calculating damage acoustic signals [28]. Fagiani et al. proposed a framework for automatic leak detection in water and natural gas networks, utilizing three statistical methods: the Gaussian mixture model, Hidden Markov Models (HMM), and a one-class support vector machine. The HMM exhibited the best performance among them [29]. Liu et al. developed a leak detection approach based on data mining, which incorporates Markov feature extraction along with a dual-phase decision strategy, integrating both short-term and long-term techniques. To detect leakages, they used a distance between raw pressure and linear pressure estimation [30]. Zhang et al. recently used the Gaussian mixture model and HMM to identify the negative pressure wave that causes the leakage [31].

One of the challenges faced by existing leak detection methods is the inadequate extraction of leak information. The features extracted from pressure and flow signals cannot fully represent all the useful information. To enhance the effectiveness of leak detection methods, more efficient feature extraction methods are needed. Feature extraction from the Markov chain can offer insights into dynamic changes and improve leak detection in the presence of noise [30]. Another problem with the above models is that while they can identify leak points, they cannot identify the size of the leak. Looking at the data as a chain, in addition to identifying the location of the leak, can help determine the leak size. The assumption is that future states depend solely on the current state and not on past events (the Markov property). This assumption typically allows for reasoning and computation with the model, which would otherwise be infeasible.

This paper introduces a new method for identifying leaks in oil and gas transmission pipelines. The developed method is based on HMM and utilizes regression analysis, called the Adaptive Hidden Markov Model (AHMM), to identify and estimate leak characteristics. It allows for the efficient modeling of raw pressure and flow data and can be used for both online and offline analysis. By simulating a leak in a section of an oil pipeline in the Iranian Oil Export Corridor, the accuracy of the method is evaluated using the pipeline pressure and flow data. Since actual leakage data were not available, OLGA v2022.1.0 software, which is widely regarded as the industrial standard for flow assurance and operational analysis in oil and gas pipelines, was adopted for simulating the pipeline in the event of a leak. Mechanical characteristics such as diameter, thickness, length of construction materials, and pipeline profiles, as well as dynamic parameters including pressure and operating flow of the line, are modeled in the OLGA software [32,33]. Many authors have conducted flow assurance studies using the OLGA code, which has been utilized to simulate system behavior and solve multiphase flow problems in various case studies. To make a comparative view, the proposed algorithm has been compared with Naive Bayes, K-NN, SVM, and logistic regression models. Moreover, the model can properly detect the size of the leakage. The capability and superiority of the model in identifying the location of leaks have been investigated as well.

1.2. Contribution Highlights and Paper Organization

The paper’s contributions are summarized as follows:

Introducing the Adaptive Hidden Markov Model (AHMM): This novel method identifies the size and location of oil leakages more accurately than existing techniques like K-NN, SVM, logistic regression, and Naive Bayes, and operates effectively in detecting small leaks. The AHMM extracts linear flow and pressure trends using the Hidden Markov concept, providing a robust analysis in both online and offline settings.
Practical Application and Flexibility: The proposed AHMM algorithm has been successfully tested using simulation data generated by OLGA, a widely used industrial standard for pipeline simulation. Although actual leak data were unavailable, the simulations were carefully calibrated with real pipeline parameters from a section of the Iranian Oil Export pipeline, ensuring that the results closely reflect real-life operational conditions. These promising results suggest that the model holds potential for practical implementation. The flexibility of the AHMM allows for effective analysis of pressure and flow data, making it adaptable to various real-world scenarios.

The paper is organized as follows: Section 2 provides a comprehensive overview of leak detection principles, emphasizing the fundamental models and equations involved. Section 3 explains the Hidden Markov Model (HMM) and its application in leak detection. Section 4 describes the proposed Adaptive Hidden Markov Model (AHMM) method for leak detection, including the regression model and the EM algorithm for parameter estimation. Section 5 presents the numerical analysis and results, demonstrating the performance of the AHMM in detecting leaks and comparing it with other common methods. Finally, Section 6 concludes the research by summarizing the findings, emphasizing the benefits of the proposed method, and suggesting directions for future research.

This structure aims to provide a comprehensive understanding of the proposed method, from theoretical foundations to practical applications, and offers a clear pathway for further advancements in pipeline leak detection.

2. Principle of Leak Detection

In this section, we review the fundamental principles of pipeline leak detection. Our proposed method is not based on traditional mathematical models but instead fully relies on simulated data and data-driven analysis. For this purpose, we have utilized the OLGA software as a digital twin model, which accurately simulates the behavior of pipelines under various conditions and provides reliable data for our analysis. These simulated data are then applied to methods such as the Adaptive Hidden Markov Model (AHMM) and other data-driven comparative techniques. In the following, we will provide further details on how the OLGA software operates, the simulation process, and the impact of leaks on the pressure and flow profiles within the pipeline.

2.1. Model and Equations

The mass transport equation, which governs the transport of a mass field represented by

m_{i}

, moving with velocity

U_{i}

, can be expressed as follows:

δ_{t} m_{i} + δ_{z} (m_{j} U_{j}) = \sum_{j} ψ_{j i} + G_{j}

(1)

In the equation,

δ_{t}

represents differentiation with respect to time, while

δ_{z}

represents spatial differentiation. The term

ψ_{j i}

refers to the rate of mass transfer between the i-th and j-th mass fields. The variable

G_{i}

represents any mass source or sink. If we consider a momentum field represented by

m_{j} U_{j}

, the equation governing the balance of momentum can be expressed as follows:

δ_{t} (m_{i} U_{i}) + δ_{z} (m_{i} U_{i}^{2}) = m_{i} \cdot g \cdot \cos (φ) + P_{i} + G_{i} U_{i} + \sum_{j} (Ψ_{ji}^{+} U_{j} - Ψ_{ji}^{-} U_{i}) + \sum_{j} F_{ji}^{'} (U_{j} - U_{i}) - F_{i}^{w} U_{i}

(2)

The equation for momentum balance takes into account several factors, including, the angle of the pipe (

φ

), pressure force (

P_{i}

), friction forces between the i-th and j-th mass fields (

F_{j i}^{'}

), and wall friction (

F^{w}

). The momentum contribution related to the mass source/sink

G_{i}

is denoted as

G_{i} U_{i}

. Momentum contributions from the mass transfer between the j-th and i-th mass fields are represented by

Ψ_{j i}

. In this context,

Ψ_{j i}^{+}

denotes the net contribution from mass field i to j, while

Ψ_{j i}^{-}

signifies the net contribution from mass field j to i.

The volume equation, which accounts for the relationship between pressure, temperature, and fluid volume, can be derived by applying the equation of state and the basic constraint that the volume of fluid is equal to the pipe volume. It can be expressed as

Σ_{L} (\frac{m_{L}}{ρ_{L}^{2}} \frac{d ρ_{L}}{d P}) δ_{t} P + Σ_{L} (\frac{m_{L}}{ρ_{L}^{2}} \frac{d ρ_{L}}{d T}) δ_{t} T + Σ_{L} \frac{1}{ρ_{L}} (δ_{z} (m_{L} U_{L}) + G_{L}) = 0

(3)

where L denotes the existing phases.

The energy balance equation for a specific mass field

m_{i}

can be written as

δ_{i} (m_{i} E_{i}) + δ_{z} (m_{i} U_{i} H_{i}) = s_{i} + Q_{i} + \sum_{j} T_{i j} E_{j}

(4)

2.2. Impact of Leakage on Flow

Considering that the network of pipelines is closed, the network and pipeline connections in the network are specified. The continuity law can be used to determine the amount of possible leakage in pipelines [3]:

d m_{1} / d t = d m_{2} / d t

(5)

where

d m / d t

is the mass changes over time, i.e., the mass flow. In this method, based on the fluid’s mass flow rate entering the line from the source station and taking into account the existing outlets along the pipeline, the amount of leakage in the line can be determined. In case of leakage in any of the pipeline points, the amount of flow drop measured between both flowmeters will indicate the amount of leakage.

For the pipeline under study, which is 87 km long and has a diameter of 42 inches, the inlet flow is approximately 60,000 barrels per day, and the inlet pressure is 18 bar. Based on this, the flow and pressure profiles have been plotted. It is worth mentioning that more detailed information about the pipeline is provided in Section 5 of the paper.

Based on Figure 1, the amount of volumetric flow drop after the occurrence of leakage is proportional to the amount of leakage. It should be noted that if the chart represented mass flow instead of volumetric flow, the mass flow drop after the leakage would be exactly equal to the mass flow of the leakage.

2.3. Impact of Leakage on Pressure

Due to frictional and other similar drops in pipelines, and their effect on the pressure of the fluid flowing through the pipeline, and considering the changes in the passage of the pipeline such as changes in altitude, weather conditions, etc., fluid equations should be used to calculate and investigate all the aforementioned changes. The most appropriate method for investigating the conditions of the fluid passing through the pipeline is to use the Bernoulli equation as follows:

\frac{P_{1}}{ρ_{1}} + \frac{1}{2} U_{1}^{2} + g h_{1} = \frac{P_{2}}{ρ_{2}} + \frac{1}{2} U_{2}^{2} + g h_{2}

(6)

In this equation, the pressure measured by the pressure transmitters along the pipeline will be used in the calculations. Additionally, the velocity of the fluid will be determined by measuring the volume flow at different points, and the pressure changes due to height changes will be evaluated in the height term in the equation. In the absence of leakage along the pipeline, the fluid pressure profile will align along the pipeline, ideally forming a straight line with a negative slope. However, in the event of a leak at any point on the line, the uniformity of the pattern at the site of the leak will be disrupted, causing a discontinuity in the pressure profile. Figure 2 illustrates the effect of leakage in the pipeline on the pressure profile. It is important to note that while the frictional effects are indeed considered in the pipeline profile, their influence is relatively small compared to the impact of elevation changes. Therefore, the pressure profile appears nearly linear, with the dominant factor being the change in elevation, and the effects of friction being less noticeable.

3. Hidden Markov Model

The definition of HMM is based on a pair of stochastic processes. In the finite state, the underlying process is a discrete homogeneous Markov chain, which is not visible, and that is why it is called hidden. As a consequence of this layer, a sequence of stochastic observations will be generated. The structure of HMMs mainly consists of a Markov chain of latent variables, where each latent variable is linked to an observable outcome.

3.1. Markov Chain

Markov chains provide a relatively straightforward and versatile framework for modeling sequential patterns in time series data. These probabilistic models have found widespread application in various fields, including finance, biology, and natural language processing. They are especially effective for modeling systems where the next state depends only on the current state, rather than on the sequence of prior events. Let us consider a random process denoted as {

q_{t},

t ∈ T} where T = {0, 1, 2, …} represents the set of time points, and S = {S₁, S₂, …, S_N} represents the set of possible states. At any given time t + 1, the state

q_{t + 1}

is only correlated with the state at the previous time

q_{t}

as

P (q_{t + 1} = S_{j} | q_{1}, q_{2}, q_{t} = S_{i}) = P (q_{t + 1} = S_{j} | q_{t} = S_{i}) = P_{ij}

, where

P_{i j}

represents the probability of transition from state

S_{j}

to state

S_{i}

. This means that the past information is fully summarized by the knowledge of the current state, which is the essence of the Markov property.

3.2. Typical HMM Method

Mathematically, a standard Hidden Markov Model (HMM) is defined by an underlying stochastic process that includes hidden states forming a Markov chain. The behavior of an HMM can be explained by the following factors:

N = Number of hidden states in the Markov chain. Even though the states are hidden, they could still correspond to physical concepts. Individual states are presented by S = {S₁, S₂, …, S_N}, and the state at time t is indicated by

q_{t}

.

π = Initial state of probability distribution indicated by π = {π_i}, where π_i =

P [q_{1} = S_{i}],

1 ≤ I ≤ N.

A = Matrix of State Transition Probabilities indicated by A = {

P_{i j}

}, where

P_{i j} = P [q_{t + 1} = S_{j} | q_{t} = S_{i}]

, 1 ≤ i, j ≤ N; where,

P_{i j}

shows the probability of transitioning from state i at time t to state j at time t + 1 which has the Markov property.

B = Emission probability distribution at state S_j indicated by B = {

b_{j} (y_{t})

}; here,

b_{j} (y_{t}) = P [y_{t} | q_{t} = S_{j}],

1 ≤ j ≤ N and describes the probability of emitting symbol y_t from state j.

The HMM assumes that the observations Y= (y₁, y₂, y₃, …, y_T) are generated from the finite hidden states, such as q₁, q₂, q₃, …, q_N. They have different properties in different states. A typical HMM is defined by a three-tuple: λ = (π, A, B).

In an HMM, each hidden state is governed by an initial probability π, and the transition between states at time t is represented by a transition matrix A. Within each hidden state q_t, an observation is emitted according to its distribution. This observable stochastic process set forms the foundation of the Hidden Markov Model. A graphical representation is provided in Figure 3.

The observed values of the random variables in the HMM can be generated from either a discrete distribution or a continuous one, such as Gaussian distribution. For an HMM to be effective in real-world applications, three fundamental problems must be addressed:

Evaluation Problem: Given a sequence of observations y₁, y₂, y₃, …, y_T and a model λ = (π, A, B), how can we efficiently calculate the probability of the sequence of observations, given the model P (Y|λ)?
Optimal State Sequence Problem: Given a sequence of observed values y₁, y₂, y₃, …, y_T and a model λ = (π, A, B), how can we determine the most likely sequence for hidden states q₁, q₂, …, q_T that best describes the sequence of observations?
Training Problem: How to optimize the parameters of the model λ = (π, A, B) to maximize P (Y|λ) and the probability of a sequence of observations, y₁, y₂, y₃, …, y_T, given the model?

In the postulated model, all three questions about AHMM will be examined and the possible responding methods will be explained. Solving the Evaluation Problem paves the way for identifying the best observations features of the linear trend. Finding the exact location of the leak is dependent on solving the Optimal State Sequence Problem. In the posited model, by Solving Training Problems, the best parameters of regression models can be estimated, and based on the information obtained from Solving the Evaluation Problem, a regression model with optimal parameters can be estimated for the data and the leakage rate can be monitored. Solving Training Problems gives the optimal modeling for data of flow and pressure inside pipelines and may help detect the intensity of leakage.

This study introduces an Adaptive Hidden Markov Model (AHMM) to extend the linear trend characteristics when the dynamic data of oil pressure or flow is predictable.

4. The Adaptive Hidden Markov Model (AHMM)

The two fundamental assumptions in this paper are as follows:

The probability of P(y_t|q_t = i, q_t_−1, y_t₋₁, …, q₁, y₁) = P(y_t|q_t = i) = b_i(y_t), which means that the observations only depend on their generating state.

P(q_t = j|y_t, q_t₋₁ = i, y_t₋₁, …, q₁, y₁) = P(q_t = j|q_t₋₁ = i) = P_ij, which means that the hidden states have the Markov property.

When a leak occurs, the fluid density at the leakage point quickly decreases due to the loss of the flow medium and a drop in pressure. Because of varying pressure levels in the pipeline transmission process, numerical-based methods often produce numerous false positives. However, both normal and leak samples can yield similar Markov chains despite their significant differences. To address this problem, we extract the linear trend in flow and pressure using the Hidden Markov Model concept.

According to the presented models, the pipeline pressure and flow are random variables that depend on the distance traveled from the origin. Therefore, the linear regression model can be used to show the pressure and flow inside the pipelines. Assume that the pipeline pressure/flow follows the following model:

y_{t} = α + β X_{t} + e_{t} for t = 1, \dots, T

(7)

where X is the design matrix,

β

represents the unknown parameter of the regression line slope,

α

is an unknown parameter indicating the intercept, y shows the observations (here the pressure or flow of the material inside the pipeline), and

e

represents the error that is assumed to follow the Gaussian distribution. When a leak happens in the pipeline, the pressure or flow at point t drops. Therefore, the linear model that the pressure/flow follows will be changed. We suppose that the data of pressure or flow levels follow the AHHM model:

y_{t} = α_{i} + β_{i} X_{t} + e_{t} for t = 1, \dots, T, i = 1, \dots, N

(8)

where

y_{t}

and

X_{t}

can be regarded as previous values. Also,

α_{i}

and

β_{i}

are unobservable fixed parameters, and the error term is shown by

e_{t} \sim N (0, σ_{i}^{2})

. The hidden state space of this model is denoted by S = {1, 2, …, N}; for example, S = {1, 2} is a special case of the AHMM with only two hidden states. S = 1 is the state that generates leak-free observations and S = 2 is the state that generates observations with leakage.

The Expectation–Maximization algorithm will be used for estimating the parameters of the AHMM. It can be utilized in two scenarios:

(i): When there are missing values;
(ii): When estimating the maximum likelihood is challenging and consequently estimating the parameters of the complete data becomes difficult.

Since the number of observations is very large, estimating the parameters in real situations can be challenging. To overcome this problem, one can use the EM algorithm. This algorithm consists of two steps. A demonstration of this design is shown in Figure 4. First, in step E of the EM algorithm, given the value of

λ^{(k)}

and observations y = {y₁, y₂, …, y_T}, the expected value of the logarithm of the likelihood function should be determined. Then, in step M, the derivative of the likelihood function’s logarithm will be maximized. As the iteration procedure is applied to these two steps, the logarithm of the likelihood function will continue to increase until it converges to a local maximum or reaches the maximum iteration number

m a x_{i t e r}

.

The details of the EM algorithm process for estimating the initial value, π_i, the transition probability, P_ij, and the expected linear part of the model,

α_{i}

,

β_{i}

, and

σ_{i}^{2}

are explained below.

4.1. Expectation–Maximization Algorithm

The Expectation–Maximization (EM) algorithm is an iterative computational approach that can be utilized to estimate maximum likelihood values, and it is widely applicable in a variety of complete data problems [30]. The parameters of the AHMM can efficiently calculated using the EM algorithm. The likelihood function of the complete data is expressed as

L (λ | Y, Q) = π_{q_{1}} \prod_{t = 1}^{T} (P_{q_{t - 1} q_{t}} b_{q_{t}} (y_{t}))

. Generally, the emission model (

b_{q_{t}} (y_{t}) = b_{i} (y_{t})

) is assumed as linear regression. In this context, λ is the vector of the parameters in the model;

λ = (A, B, π) = (P_{i j}, α_{i}, β_{i}, σ_{i}^{2}, π_{i})

, which contains the parameters of linear regression

(α_{i}, β_{i}, σ_{i}^{2})

and transition probability between hidden states (P_ij) and the probability of starting from the i-th hidden state, π_i. The likelihood function for the incomplete data can be calculated as follows.

L_{c} (λ | y) = P (y | λ) = \sum_{q_{1}, \dots, q_{t} = 1}^{N} P (y, q_{t} | λ) = \sum_{q_{1}, \dots, q_{t} = 1}^{N} (π_{q_{1}} \prod_{t = 1}^{T} P_{q_{t - 1} q_{t}} b_{q_{t}} (y_{t}))

(9)

In step M, the maximization of Equation (9) by using the Lagrange multiplier

γ

and assuming

\sum_{i = 1}^{N} π_{i} = 1

yields

{\hat{π}}_{i} = \frac{P (q_{1} = i | Y, λ^{(k)})}{\sum_{i = 1}^{N} P (q_{1} = i | Y, λ^{(k)})}

(10)

By computing the differentiation based on P_ij in a similar way as Equation (9), we will have

{\hat{P}}_{i j} = \frac{\sum_{t = 1}^{T} P (q_{t} = i, q_{t + 1} = j | λ^{(k)}, Y)}{\sum_{t = 1}^{T} P (q_{t + 1} = j | λ^{(k)}, Y)}

(11)

Appendix A contains details of the derivation process for

π_{i}

, the initial value and

P_{i j}

, the transition probability.

Also, because the observations are generated based on the linear regression, the observations of the hidden i-th state have a Gaussian distribution with a mean of

X β_{i}

and variance of

σ_{i}^{2}

. On the other hand,

E (y_{i} | q_{i}, X) = X β_{i}

and

y_{i} \sim N (X β_{i}, σ_{i}^{2})

, so we have

\sum_{i = 1}^{N} (\sum_{t = 1}^{T} l o g b_{i} (y_{t})) p (q_{t} = i | Y, λ) = \sum_{i = 1}^{N} (\sum_{t = 1}^{T} l o g [b_{i} (y_{t})]) P (q_{t} = i | Y, λ^{(k)})

(12)

The partial derivatives of Equation (9) concerning

α_{i}, β_{i},

and

σ_{i}^{2}

are taken from the below Gaussian distribution:

b_{i} (y_{t}) = φ (y_{t} : α_{i} + X β_{i}, σ_{i}^{2}) = \frac{1}{\sqrt{2 π σ_{i}^{2}}} \exp \{- \frac{{(y_{t} - α_{i} + X β_{i})}^{2}}{2 σ_{i}^{2}}\}

(13)

By taking the partial derivative concerning the expected

α_{i}

in Equation (13) and setting it to 0, the equation for deriving the expected

α_{i}

and after differentiation will be

{\hat{α}}_{i} = [\frac{\sum_{t = 1}^{T} P (q_{t} = i | Y, λ^{(k)}) (y_{t} - {β^{(k)}}_{i} X_{t})}{\sum_{t = 1}^{T} P (q_{t} = i | Y, λ^{(k)})}]

(14)

With a similar partial derivative with respect to the expected

β_{i}

(the volatility of data, σ²) in Equation (13), setting it to 0, the equation for deriving the expected

β_{i}

(the volatility of data, σ²) after differentiation will be

{\hat{β}}_{i} = [\frac{\sum_{t = 1}^{T} P (q_{t} = i | Y, λ^{(k)}) (y_{t} - {\hat{α}}_{i})}{\sum_{t = 1}^{T} X_{t}^{2} P (q_{t} = i | Y, λ^{(k)})}]

(15)

{\hat{σ}}_{i}^{2} = [\frac{\sum_{t = 1}^{T} P (q_{t} = i | Y, λ^{(k)}) {(y_{t} - {\hat{α}}_{i} - X_{t} {\hat{β}}_{i})}^{2}}{\sum_{t = 1}^{T} X_{t}^{2} P (q_{t} = i | Y, λ^{(k)})}]

(16)

4.2. Efficient Calculation of the Desired Quantities

To perform this evaluation, the forward–backward algorithm is used here. This algorithm adopts k-period parameters (

λ^{(k)}

) to estimate all k + 1 period parameters (

λ^{k + 1}

). The declaration of the forward variable can be written as

θ_{t} (i) = P (y_{1}, \dots, y_{t}, q_{t} = i | λ)

. This can be calculated in the following three steps:

\begin{matrix} θ_{1} (i) = π_{i} b_{i} (y_{1}) \\ θ_{t + 1} (j) = [\sum_{i = 1}^{N} θ_{t} (i) P_{i j}] b_{j} (y_{t + 1}) t = 2, 3, \dots, T \\ P (Y | λ) = \sum_{i = 1}^{N} P (y_{1}, y_{2}, \dots, y_{T}, q_{T} = i | λ) = \sum_{i = 1}^{N} θ_{T} (i) \end{matrix}

(17)

And, in a similar way for the backward variable, we have

κ_{t} (i) = P (y_{t + 1}, y_{t + 2}, \dots, y_{T} | q_{t} = i, λ)

(18)

The three backward steps are

\begin{matrix} κ_{T} (i) = 1 \\ κ_{t} (i) = [\sum_{j = 1}^{N} κ_{t + 1} (j) P_{i j}] b_{j} (y_{t + 1}) t = T - 1, T - 2, \dots, 1 \\ P (Y | λ) = \sum_{i = 1}^{N} {P (y}_{1} {, y}_{2}, \dots {, y}_{T} {| q}_{1} = {i) P (q}_{1} = i) \\ = \sum_{i = 1}^{N} κ_{1} (i) b_{i} {(y}_{1}) π_{i} \end{matrix}

(19)

Then, we can calculate

γ_{t} (i)

and

η_{t} (i, j)

based on

κ_{t} (i)

and

θ_{t} (i)

:

γ_{t} (i) = P (q_{t} = i | y_{1}, y_{2}, y_{3}, \dots, y_{T}, λ) = \frac{P (y_{1}, y_{2}, y_{3}, \dots, y_{T}, q_{t} = i | λ)}{P (y_{1}, y_{2}, y_{3}, \dots, y_{T} | λ)} = \frac{θ_{t} (i) κ_{t} (i)}{\sum_{i} θ_{t} (i) κ_{t} (i)}

(20)

η_{t} (i, j) = P (q_{t} = i, q_{t + 1} = j | Y, λ) = \frac{P (q_{t} = i, y_{1}, y_{2}, \dots, y_{t}) P_{ij} b_{j} (y_{t + 1}) P (y_{t + 2}, \dots, y_{T} | q_{t + 1} = j)}{P (y_{1}, y_{2}, \dots, y_{T})} = \frac{θ_{t} (i) P_{ij} b_{j} (y_{t + 1}) κ_{t + 1} (j)}{\sum_{i} \sum_{j} θ_{t} (i) P_{ij} b_{j} (y_{t + 1}) κ_{t + 1} (j)}

(21)

Using Equations (20) and (21), one can estimate the transition probability and initial state probability of the AHMM as follows:

Transition probability:

P_{i j} = \frac{\sum_{t = 1}^{T - 1} η_{t} (i, j)}{\sum_{t = 1}^{T - 1} γ_{t} (i)}, t = 1, 2, \dots, T - 1, i, j \in N

(22)

Initial state probability:

π_{i} = γ_{1} (i)

(23)

The proposed regression models have three unknown parameters

(α_{i}, β_{i}, σ_{i}^{2})

. In this case, the parameter estimation using the EM algorithm will proceed as follows:

{α_{i}}^{(k + 1)} = [\frac{\sum_{t = 1}^{T} P (q_{t} = i | y, λ^{(k)}) (y_{t} - {β^{(k)}}_{i} X_{t})}{\sum_{t = 1}^{T} P (q_{t} = i | y, λ^{(k)})}] {β_{i}}^{(k + 1)} = [\frac{\sum_{t = 1}^{T} P (q_{t} = i | y, λ^{(k)}) (y_{t} - {α^{(k + 1)}}_{i})}{\sum_{t = 1}^{T} X_{t}^{2} P (q_{t} = i | y, λ^{(k)})}] {σ_{i}^{2}}^{(k + 1)} = [\frac{\sum_{t = 1}^{T} P (q_{t} = i | y, λ^{(k)}) {(y_{t} - {α^{(k + 1)}}_{i} - X_{t} {β^{(k + 1)}}_{i})}^{2}}{\sum_{t = 1}^{T} X_{t}^{2} P (q_{t} = i | y, λ^{(k)})}]

(24)

After estimating the parameters using Equation (24), an optimized model is obtained for the AHMM. Based on this optimized model by EM and using the Viterbi algorithm, the most possible hidden states for each observation can be achieved.

4.3. Viterbi Algorithm

The Viterbi algorithm is a dynamic programming technique used to determine the most probable sequence of hidden states based on a given sequence of observations. It is widely used in signal processing and machine learning, especially in the context of Hidden Markov Models. By computing the probabilities of all possible state sequences and selecting the one with the highest probability, the Viterbi algorithm is able to find the most likely explanation for a given observation sequence. The parameters of the algorithm are

δ_{t} (i) = \max_{q_{1}, \dots q_{t - 1}} P (q_{1} \dots q_{t - 1} {, q}_{t} = {i, y}_{1} {, y}_{2} \dots y_{t})

(25)

The variable (

δ_{t} (i)

) in Equation (25) optimizes the probability of being in state i at time t. This variable can be estimated using the following recursive relations:

\{\begin{matrix} δ_{1} (i) = \max_{i} {P (q}_{1} = {i, y}_{1}) = π_{i} b_{i} (y_{1}) \\ ψ_{1} (i) = 0 \end{matrix} i = 1, \dots, N \{\begin{array}{l} δ_{t} (j) = \max_{q_{t - 1} = i} [δ_{t - 1} (i) P_{ij} b_{j} (y_{t})] \\ ψ_{t} (j) = a r g \max_{q_{t - 1} = i} [δ_{t - 1} (i) P_{ij}] \end{array} t = 2, \dots, T, j = 1, \dots, N

(26)

\{\begin{array}{l} p_{T}^{*} = \underset{i = 1, \dots, N}{m a x} [δ_{T} (i)] \\ q_{T}^{*} = \underset{i = 1, \dots, N}{a r g m a x} (δ_{T} (i)) \end{array}

(27)

q_{T}^{*} = \underset{i = 1, \dots, N}{a r g m a x} [ψ_{t + 1} (q_{t + 1}^{*})] t = T - 1, \dots, 1

(28)

Using the Viterbi algorithm, the hidden state of each observation can be determined by Equation (28).

By adapting to changes in hidden states, this algorithm aids in precisely locating pipeline leaks. The Viterbi algorithm works by maximizing the expected number of correctly identified distinct states. Graphically, this is represented as the probable paths within a fragment of an AHMM lattice. The primary objective of the Viterbi algorithm is to identify the most probable pathway, which can subsequently be used to accurately determine the leak’s location, as shown in Figure 5.

5. Numerical Analysis

In this section, the AHMM is simulated and estimations of the model parameters are compared with the true values. The model is represented by Equation (8). To simulate the model, the parameters should be determined and the estimation should be evaluated. The considered parameters include the sample size of T = 50, the number of hidden states I = 2, and the design matrix

X_{t} = [12 t] = {[12, \dots, 600]}^{T}

. We set α₁ = 2000 and α₂ such that the two lines intersect at point

X_{13} = 156

. In addition, β₁ = −2, β₂ = 0, −1, −2, −3, and −4 are the slopes, and the error term

e_{t} \sim N (0, σ_{i}^{2})

is assumed with variances

σ_{1}^{2} = 1

and

σ_{2}^{2} = 2

.

Initial values of the EM algorithm are π₁ = 0.5, P₁₁ = P₂₂ = 0.5 for the maximum iteration number

m a x_{i t e r}

= 200, where the value of convergence criterion ε is set to

1 \times 10^{- 6}

. The parameters of the AHMM are re-estimated by the EM algorithm. The results of this study are summarized in Table 1. It can be observed that the estimated parameters are nearly identical to the true values with very few standard deviations, proffering solid justifications for the method. For instance, in the case of setting number 1, the real values of β₁ and β₂ are −2 and 0, respectively, and after 200 iterations, the final estimation of these values will be −1.9999 and 0.0009. Also, the standard deviation of the estimations from their real values is 0.0293 and 0.0314, showing that the slope coefficients of the lines have been found with high accuracy.

Setting number 2 of the above table is considered as an arbitrary example exhibiting the results of the simulation in Figure 6 and Figure 7. To identify the convergence trend, the EM algorithm process runs for 200 iterations. The true values are denoted by the dotted red line and the estimated parameter values are displayed by blue points. Generally, after the first 20–30 iterations, the estimated values will approach the true values closely, demonstrating that the EM algorithm can effectively approximate the parameters of the AHMM.

In Figure 6, the horizontal axis indicates the numbers of the design matrix, which are extended from 12 to 600. The vertical axis indicates the values simulated by the regression model. The green lines indicate the regression models and their parameters have been estimated using the AHMM in each iteration. Moreover, black lines indicate the final estimated models. As can be seen, the estimated hidden values and states correspond to the actual values and completely identify the real states. In Figure 7, the convergence trend for the actual preliminary parameters

β_{i}

,

α_{i}

,

σ_{i}^{2}

,

P_{i i}

, and

i = 1

, 2 is shown using the iterative EM algorithm. These parameters indicate the intercept, slope, and standard deviation of the observations and the hidden state transfer matrix between the two regression models. As depicted, the real values differ from the initial values. With the iterative process of the EM algorithm, however, the estimated values closely match the real values after 30 iterations, and the convergence process remains almost constant after the 30th iteration.

5.1. Practical Results of the Model

Pipeline leak detection systems are usually evaluated in terms of speed of detection, ease of implementation, affordability, ability to detect faults, and determination of deterioration conditions. In this paper, efforts have been made to satisfy these conditions with the practical implementation of the system.

The pipeline in this study is part of the Iranian Oil Export Corridor, which connects two major pump stations. The pipeline is 87 km long and has a diameter of 42 inches. The inlet flow to the pipeline is 550 kg/s, and the inlet pressure is 18 bar. The pipeline profile is shown in Figure 8. Next, different leakage scenarios were applied and pressure and flow characteristics of different points were extracted from OLGA software. To evaluate the validity of the simulation results, the flow and pressure real data have been matched at the important check points.

These scenarios include leakage with (0, 0.1, …, 2) inch sizes and are assumed at (5, 10, 20, …, 80) km distances from the origin. With 20 different leakages in nine different locations, 180 different scenarios were created. The flow and pressure of information for each scenario were observed in N = 78 locations. A training of n samples was chosen for each scenario, where n = 0.6 * N, and a test sample size of n = 0.4 * N.

For each scenario, the model is fitted using training samples and its performance is tested for test samples using the tested model.

Form training is used for the estimation of AHMM parameters. A hidden state change is adopted for identifying leak locations and changes in the slope of fitted lines (indicating the size of leakage).

The details of the training stage are specific to the classifier type. The results were elicited utilizing four different well-known classifiers: the support vector machine (SVM), k-nearest neighbor (k-NN), Naive Bayes, and linear logistic regression. To measure the accuracy of the model in leakage detection, the F₁ score was used, which is examined in the following subsection. To extract the leak size and distance information at the test sample, Solving Training and the Optimal State Sequence Problems were utilized. Root Mean Squared Error (RMSE) has been used to measure the accuracy of leakage estimation, which is also examined below.

5.2. Measures of Fault Detection and Performance

The main goal of leak detection is to trigger an alarm when anomalies are detected in the data. Leak detection can be framed as a binary classification problem, where the system distinguishes between normal and abnormal conditions. Alarm triggering is typically associated with the categorization of events into ‘negative’ and ‘positive’ classes, which include true negative (TN), false negative (FN), true positive (TP), and false positive (FP) outcomes.

P_{r}

and

R

, as the precision and recall rates, respectively, are described as follows:

P_{r} = \frac{number of T P}{number of T P + number of F P}

(29)

R = \frac{number of T P}{number of T P + number of F N}

(30)

Additionally, the F₁ score is another useful measure defined as

F_{1} = \frac{2 P_{r}}{P_{r} + R}

(31)

These detection algorithms have been adopted for the data and the findings have been juxtaposed with the following parameters [25].

For gauging the reconstruction error at each size, the RMSE is utilized, given by

R M S E = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} {(y_{i} - {\hat{y}}_{i})}^{2}}

(32)

where AHMM components are computed for estimating the reconstructed leakage size in test samples.

5.3. Numeric Results

As shown in Figure 9, The pressure/flow transmitters are embedded along the pipeline at a specified distances from the source. At each point, a Remote Transfer Unit (RTU) is embedded to receive information from local equipment and transmits it to the Master Terminal Unit (MTU).

The pressure and flow information can be used separately to identify the leak size and location. The leak detection process based on the data is as follows:

(i): Simulated leak pressure data with (0, 0.1, …, 2) inch sizes located at (5, 10, 20, …, 80) km distances from the source were used as inputs to the OLGA software.
(ii): For each scenario, pressure data were observed for N = 78 locations.
(iii): Each set of observed data was randomly split into a test sample and a training sample.
(iv): The AHMM Optimal State Sequence Problem classifiers obtained from the training sample were applied, assigning each of the test data to leak or without leak states.
(v): Using k-NN, Naive Bayes, SVM classifier, linear logistic regression algorithms, and examples, the experimental samples were classified.
(vi): After classifying different scenarios with the presented algorithms, the values of indicators of precision, i.e., recall and F₁ score, are calculated and the accuracy of the classifiers is compared.

Moreover, for the accuracy of the model in measuring the amount of leakage, the following procedures are followed; parts (i–iv) are similar to the above process.

(i): For various scenarios of the leak, by EM algorithm, the amount of pressure was estimated by the fitted regression models.
(ii): For each model, the leak size was determined using Optimal State Sequence problem-solving.
(iii): Using the AHMM, sample point pressures were estimated.
(iv): With having the true pressure at the test sample points, RMSE was used for size estimation with the AHMM fitted model.
(v): The size of the leak in the selected model was compared with the actual values, and the RMSE of the method was calculated.

The above steps were followed to model the flows. The accuracy of identifying the amount and location of leakage for flow data was obtained using a similar algorithm. The steps of the AHMM-based leakage detection method are depicted in Figure 10.

As depicted in Figure 10, the proposed method in this paper consists of two main sections. The modeling of the pipeline status is performed in the first part. This part consists of a sample section, data transformation, feature extraction, and model construction. The second part introduces a decision method that employs a distance indicator as a classifier switch. This method allows for the selection of an appropriate detection model for any unlabeled samples.

Table 2 and Table 3 exhibit the accuracy of identification of location for different leaks. In the results, when adopting pressure data, the highest F₁ score of AHMM is 0.978.

This model has the best performance amongst the methods used for classification. The K-NN model has an F₁ score of 0.944, the SVM model has a statistic value of 0.938, and the corresponding statistic values are 0.914 and 0.946 for logistic regression and Naive Bayes, respectively. Evidently, based on the collected data, while the best performance is displayed by AHMM followed by Naive Bayes, the worst performance is exhibited by logistic regression.

When using flow data, the highest F₁ score of AHMM is 0.978. While the K-NN model has an F₁ score of 0.988, the SVM model has a statistic value of 0.984; the corresponding statistic values are 0.989 and 0.952 for logistic regression and Naive Bayes, respectively. It can be observed from the examined data that the best performance is for logistic regression and then for K-NN. Conversely, the worst performance is shown by Naive Bayes.

When using flow data, AHMM performance is almost identical compared to pressure data; however, the performance of K-NN, SVM, and logistic regression classification methods is better than that of this model.

The proposed model can successfully estimate the amount of pressure and flow. Therefore, to measure the capability of the model, the RMSE amount has been proposed for estimating the amount of pressure and flow in the AHMM. Figure 11 shows the estimated RMSE values for flow and pressure for different leak sizes (0, 0.1…, 2) in inches. As displayed, the AHMM data are used based on real pressure estimation in real samples, and this estimation is compared with the actual value for different leak sizes. A similar process is performed to compare the estimated flow with the actual flow and use the RMSE.

As observed in Figure 11 and also in Figure 1 and Figure 2, the increase in leak size has a significant impact on the pressure and flow profiles in the pipeline. With the increase in leak size, the slope of the regression lines in the pressure profile increases. This change in the flow profile is reflected as an intercept increase after the leak occurrence. The increase in the slope of the regression lines leads to significant changes in the AHMM parameters, which in turn improves the model’s performance in detecting leaks. Specifically, for leaks larger than 0.3 inches, the AHMM shows much better performance in pressure data. This improvement is less apparent in flow data due to structural differences and smaller changes in the intercept in the model parameters. However, the results from flow data are still satisfactory and can be used for accurately estimating the size of the leaks. Overall, the analysis of pressure and flow data shows that the AHMM is capable of effectively detecting both small and large leaks.

Considering the authors’ approach in practically implementing the results and taking into account the costs of installation, operation, maintenance, and calibration of pressure measurement sensors compared to flow measurement sensors, along with the high accuracy of pressure measurement equipment used in the oil and gas industry, the results are operationally valuable and useful.

6. Conclusions

This paper presented a novel approach, the AHMM (Adaptive Hidden Markov Model), which combined a regression model with a Hidden Markov Model to identify the size and location of leakages in oil pipelines using pressure or flow data. The EM algorithm was employed for parameter estimation with maximum likelihood, and numerical analysis demonstrated that the calculated parameter values closely matched the actual preset values. The results indicated that the proposed AHMM-based method achieved significant improvements in leak detection compared to conventional methods, with improvements ranging from at least 4% to a maximum of 7% in all indicators based on pressure classification.

Moreover, the AHMM-based method demonstrated several advantages, including higher performance in leak detection, the ability to detect small leaks, and accurately locating and determining the size of leaks. Additionally, this method could be easily applied, with minor adjustments, to other pipelines, such as those for water and gas, making it suitable for a broad range of applications. The proposed method also enabled the data-based monitoring of the pipeline status and benefited from boosting and updating the database of training samples to enhance the accuracy of diagnosis and other functions.

Given the complexities in correlating pressure and flow data due to non-linear factors, future research could explore a two-stage approach. In this method, the pressure and flow data would be processed separately to detect leaks, and then the results could be combined into a final model that leverages both data sets. This approach could further improve the accuracy and robustness of leak detection in pipelines.

Overall, the AHMM-based method presented in this paper offered a promising approach for effective leak detection in oil pipelines and held potential for further advancements in the field of pipeline monitoring and maintenance. The model was highly flexible and capable of expansion to multiple linear regression or non-linear models by altering the likelihood function. Additionally, incorporating the Multiple Classifier Fusion Algorithm (MCFA) could have enhanced the accuracy and robustness of leak detection. These topics suggested potential areas for further research.

Author Contributions

Conceptualization, O.Z. and M.R.S.; methodology, S.V.N.; software, O.Z.; validation, M.M., M.R.S. and B.M.; formal analysis, M.R.S.; investigation, S.V.N.; resources, M.R.S.; data curation, O.Z.; writing—original draft preparation, O.Z.; writing—review and editing, S.V.N.; visualization, M.R.S.; supervision, S.V.N.; project administration, M.R.S.; funding acquisition, M.R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Symbol	Description
Physical Quantities
$m$	Mass
$U$	Velocity
$G$	Mass Source/Sink
$g$	Acceleration Of Gravity
$h$	Height
$P$	Pressure Force
φ	Fluid Angle to Gravity
$ψ$	Rate of Mass Transfer
L	Existing Phases
E	Field Energy
H	Enthalpy of the Field
S	Source/Sink of Enthalpy
Q	Heat Flow in Pipe Wall
T	Energy Transfer Between Different Fields
$ρ$	Density
$F^{'}$	Friction Forces
$F^{w}$	Wall Friction
$Ψ$	Momentum Contributions Related to Mass Transfer
Mathematical Operations
$δ_{t}$	Differentiation In Time
$δ_{z}$	Spatial Differentiation
∑	Summation
cos	Cosine
log	Logarithm
max	Maximum
argmax	Argument of the Maximum
$δ_{t} m_{i}$	Time Differentiation of Mass
$δ_{z} m_{i}$	Spatial Differentiation of Mass
$δ_{t} m_{i} U_{i}$	Time Differentiation of Momentum
$δ_{z} m_{i} U_{i}$	Spatial Differentiation of Momentum
$d m / d t$	Mass Change Over Time
Statistical Measures
$β$	Unknown Parameter of the Regression Line Slope
$α$	Unknown Parameter Indicating the Intercept
$γ$	Lagrange Multiplier
$σ^{2}$	Variance
L(∙)	Likelihood Function
Performance Metrics
F₁	F₁ Score
DR	Detection Rate
FAR	False Alarm Rate
MAR	Missed Alarm Rate
RMSE	Root Mean Square Error
P_r	Precision
R	Recall Rates
FP	False Positive
TN	True Negative
FN	False Negative
Hmm Parameters
$α_{i}$	Intercept Parameter in AHMM
$β_{i}$	Regression Line Slope Parameter in AHMM
$σ_{i}^{2}$	Variance of the Error Term for the i-th Hidden State
${\hat{α}}_{i}$	Estimator of the Parameter $α_{i}$ in AHMM
${\hat{β}}_{i}$	Estimator of the Parameter $β_{i}$ in AHMM
${\hat{σ}}_{i}^{2}$	Estimator of the Variance of the Error Term for the i-th Hidden State
p_ij	Transition Probability from State i to j
π_i	Initial Probability of State i
$λ$	Parameter Vector in the AHMM
N	Number of Hidden States Within the Markov Chain
A	Matrix of State Transition Probabilities
B	Emission Probability Distribution
$y_{t}$	Observed Symbol at Time t
$q_{t}$	Hidden State at Time t
$q_{t + 1}$	Hidden State at Time t + 1
$S$	Set of Possible States of AHMM
$s_{i}$	State of the System at the i-th Time Step in the Hidden Markov Model
$P_{i j}$	Probability of Transition from State $s_{j}$ to State $s_{i}$
$θ_{t} (i)$	Forward Variable
$κ_{t} (i)$	Backward Variable
Q	Set of Hidden States
X	Design Matrix
$m a x_{i t e r}$	Maximum Number of Iterations
Abbreviations
RTU	Remote Transfer Unit
MTU	Master Terminal Unit
SCADA	Supervisory Control and Data Acquisition
EM	Expectation–Maximization algorithm

Appendix A. Details the Derivation Process

To apply the EM algorithm by assuming that

λ^{(k)}

is known, the logarithm of the likelihood function for the complete data in Equation (9) can be expressed as follows:

Q (λ, λ^{(k)}) = E (\log (P (Y, q | λ) | Y, λ^{(k)})) = \sum_{q \in Q} \log π_{q_{1}} P (q | Y, λ^{(k)}) + \sum_{q \in Q} (\sum_{t = 1}^{T} P_{q_{t - 1} q_{t}}) P (q | Y, λ^{(k)}) + \sum_{i = 1}^{N} (\sum_{t = 1}^{T} \log (b_{i} (y_{t}))) P (q_{t} = i | Y, λ^{(k)})

(A1)

In step M, the maximization of Equation (A1) by using the Lagrange multiplier

γ

and assuming

\sum_{i = 1}^{N} π_{i} = 1

yields

\frac{1}{π_{i}} P (q_{1} = i | Y, λ^{(k)}) + γ = 0

(A2)

Taking transpose of Equation (A2), it becomes

π_{i} = \frac{P (q_{1} = i | Y, λ^{(k)})}{- γ}

Because

\sum_{i = 1}^{N} π_{i} = 1

, then

\sum_{i = 1}^{N} π_{i} = \sum_{i = 1}^{N} \frac{P (q_{1} = i | Y, λ^{(k)})}{- γ} = 1

and

- γ = \sum_{i = 1}^{N} P (q_{1} = i | Y, λ^{(k)})

Thus, the estimator of the initial value

π_{i}

can be obtained as

{\hat{π}}_{i} = \frac{P (q_{1} = i | Y, λ^{(k)})}{\sum_{i = 1}^{N} P (q_{1} = i | Y, λ^{(k)})}

(A3)

For the

\frac{\partial}{\partial P_{i j}} [\sum_{j = 1}^{N} \sum_{i = 1}^{N} (\sum_{t = 1}^{T} l o g P_{i j}) P (q_{t} = i, q_{t + 1} = j | Y, λ^{(k)})]

in Equation (A1), we can use the Lagrange multiplier γ and let

\sum_{j = 1}^{N} P_{i j} = 1

to derive the transition probability

P_{i j}

,

\frac{\partial}{\partial P_{i j}} [\sum_{j = 1}^{N} \sum_{i = 1}^{N} (\sum_{t = 1}^{T} l o g P_{i j}) P (q_{t} = i, q_{t + 1} = j | Y, λ^{(k)}) + γ (\sum_{j = 1}^{N} P_{i j} - 1)] = 0

(A4)

Taking partial derivative of Equation (A3) with respect to the transition probability

P_{i j}

setting it to 0, we obtain

\frac{1}{P_{i j}} = \sum_{t = 1}^{T} P (q_{t} = i, q_{t + 1} = j | λ^{(k)}, Y) + γ = 0

Taking transpose, it can be modified as

P_{i j} = \frac{\sum_{t = 1}^{T} P (q_{t} = i, q_{t + 1} = j | λ^{(k)}, Y)}{- γ}

Because

\sum_{j = 1}^{N} P_{i j} = 1

, so

\sum_{j = 1}^{N} P_{i j} = \sum_{j = 1}^{N} \frac{\sum_{t = 1}^{T} P (q_{t} = i, q_{t + 1} = j | λ^{(k)}, Y)}{- γ} = 1

Thus, the estimator of the transition probability

P_{i j}

can be obtained as

{\hat{P}}_{i j} = \frac{\sum_{t = 1}^{T} P (q_{t} = i, q_{t + 1} = j | λ^{(k)}, Y)}{\sum_{t = 1}^{T} P (q_{t + 1} = j | λ^{(k)}, Y)}

(A5)

References

Duru, C.; Ani, C. A statistical analysis on the leak detection performance of underground and overground pipelines with wireless sensor networks through the maximum likelihood ratio test. Sadhana 2017, 42, 1889–1899. [Google Scholar] [CrossRef]
PHMSA. Available online: https://www.phmsa.dot.gov/data-and-statistics/pipeline/pipeline-incident-20-year-trends (accessed on 18 April 2024).
Adegboye, M.A.; Fung, W.-K.; Karnik, A. Recent Advances in Pipeline Monitoring and Oil Leakage Detection Technologies: Principles and Approaches. Sensors 2019, 19, 2548. [Google Scholar] [CrossRef] [PubMed]
Henrie, M.; Carpenter, P.; Nicholas, R.E. Pipeline Leak Detection Handbook; Elsevier: Amsterdam, The Netherlands, 2016. [Google Scholar] [CrossRef]
Wang, C.; Zhang, Y.; Song, J.; Liu, Q.; Dong, H. A novel optimized SVM algorithm based on PSO with saturation and mixed time-delays for classification of oil pipeline leak detection. Syst. Sci. Control Eng. 2019, 7, 75–88. [Google Scholar] [CrossRef]
Rajasekaran, U.; Kothandaraman, M. A Survey and Study of Signal and Data-Driven Approaches for Pipeline Leak Detection and Localization. J. Pipeline Syst. Eng. Pract. 2024, 15, 03124001. [Google Scholar] [CrossRef]
Zadehbagheri, O.; Salehizadeh, M.R.; Naghavi, V.; Moatari, M. Design of Pipeline leak Detection System using Neural Network on Scada Platform of National Iranian Oil Company. Pet. Res. 2020, 31, 39–50. [Google Scholar]
Valizadeh, S.; Moshiri, B.; Salahshoor, K.; Hakim, A.H.; Vasant, P.; Barsoum, N. Multiphase Pipeline Leak Detection Based on Fuzzy Classification. AIP Conf. Proc. 2009, 1159, 72–80. [Google Scholar] [CrossRef]
Navarro, A.; Begovich, O.; Sánchez, J.; Besancon, G. Real-Time Leak Isolation Based on State Estimation with Fitting Loss Coefficient Calibration in a Plastic Pipeline. Asian J. Control 2017, 19, 255–265. [Google Scholar] [CrossRef]
Song, J.; He, X. Model-based fault diagnosis of networked systems: A survey. Asian J. Control 2022, 24, 526–536. [Google Scholar] [CrossRef]
Korlapati, N.V.S.; Khan, F.; Noor, Q.; Mirza, S.; Vaddiraju, S. Review and analysis of pipeline leak detection methods. J. Pipeline Sci. Eng. 2022, 2, 100074. [Google Scholar] [CrossRef]
Qu, Z.; Feng, H.; Zeng, Z.; Zhuge, J.; Jin, S. A SVM-based pipeline leakage detection and pre-warning system. Measurement 2010, 43, 513–519. [Google Scholar] [CrossRef]
Valizadeh, S.; Moshiri, B.; Salahshoor, K. Leak Detection in Transportation Pipelines Using Feature Extraction and KNN Classification. In Pipelines 2009; American Society of Civil Engineers: Reston, VA, USA, 2009; pp. 580–589. [Google Scholar] [CrossRef]
Cascianelli, S.; Costante, G.; Crocetti, F.; Ricci, E.; Valigi, P.; Fravolini, M.L. Data-based design of robust fault detection and isolation residuals via LASSO optimization and Bayesian filtering. Asian J. Control 2021, 23, 57–71. [Google Scholar] [CrossRef]
Habibi, H.; Howard, I.; Habibi, R. Bayesian Fault Probability Estimation: Application in Wind Turbine Drivetrain Sensor Fault Detection. Asian J. Control 2020, 22, 624–647. [Google Scholar] [CrossRef]
Habibi, H.; Howard, I.; Habibi, R. Bayesian Sensor Fault Detection in a Markov Jump System. Asian J. Control 2017, 19, 1465–1481. [Google Scholar] [CrossRef]
Siddique, M.F.; Ahmad, Z.; Ullah, N.; Ullah, S.; Kim, J.-M. Pipeline Leak Detection: A Comprehensive Deep Learning Model Using CWT Image Analysis and an Optimized DBN-GA-LSSVM Framework. Sensors 2024, 24, 4009. [Google Scholar] [CrossRef]
Yang, D.; Hai, X.; Ren, Y.; Cui, J.; Li, K.; Zeng, S. A hybrid fault prediction method for control systems based on extended state observer and hidden Markov model. Asian J. Control 2023, 25, 418–432. [Google Scholar] [CrossRef]
Jia, T.; Song, J.; Niu, Y.; Chen, B.; Cao, Z. Optimized hybrid design with stabilizing transition probability for stochastic Markovian jump systems under hidden Markov mode detector. Asian J. Control 2022, 24, 2787–2795. [Google Scholar] [CrossRef]
Wang, A.; Fei, M.; Song, Y. Optimal fault-tolerant control for Markov jump power systems with asynchronous actuator faults. Asian J. Control 2023, 25, 4466–4480. [Google Scholar] [CrossRef]
Lu, D.; Zeng, G.; Liu, J. Non-Fragile Simultaneous Actuator and Sensor Fault-Tolerant Control Design for Markovian Jump Systems Based on Adaptive Observer. Asian J. Control 2018, 20, 125–134. [Google Scholar] [CrossRef]
Saize, S.; Yang, X. On the definitions of hidden Markov models. Appl. Math. Model. 2024, 125, 617–629. [Google Scholar] [CrossRef]
Westhead, D.R.; Vijayabaskar, M.S. (Eds.) Hidden Markov Models; Springer: New York, NY, USA, 2017; Volume 1552. [Google Scholar] [CrossRef]
Che, J.; Zhu, Y.; Zhou, D. Hidden Markov model-based robust H_∞ fault estimation for Markov switching systems with application to a single-link robot arm. Asian J. Control 2021, 23, 2227–2238. [Google Scholar] [CrossRef]
Ozkan, H.; Ozkan, F.; Kozat, S.S. Online Anomaly Detection under Markov Statistics with Controllable Type-I Error. IEEE Trans. Signal Process. 2016, 64, 1435–1445. [Google Scholar] [CrossRef]
Ai, C.; Sun, X.; Zhao, H.; Ma, R.; Dong, X. Pipeline damage and leak sound recognition based on HMM. In Proceedings of the 2008 7th World Congress on Intelligent Control and Automation, Chongqing, China, 25–27 June 2008; pp. 1940–1944. [Google Scholar] [CrossRef]
Qiu, J.; Liang, W.; Zhang, L.; Yu, X.; Zhang, M. The early-warning model of equipment chain in gas pipeline based on DNN-HMM. J. Nat. Gas Sci. Eng. 2015, 27, 1710–1722. [Google Scholar] [CrossRef]
Ai, C.; Zhao, H.; Ma, R.; Dong, X. Pipeline Damage and Leak Detection Based on Sound Spectrum LPCC and HMM. In Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications, Jinan, China, 16–18 October 2006; pp. 829–833. [Google Scholar] [CrossRef]
Fagiani, M.; Squartini, S.; Gabrielli, L.; Severini, M.; Piazza, F. A Statistical Framework for Automatic Leakage Detection in Smart Water and Gas Grids. Energies 2016, 9, 665. [Google Scholar] [CrossRef]
Liu, J.; Zang, D.; Liu, C.; Ma, Y.; Fu, M. A leak detection method for oil pipeline based on markov feature and two-stage decision scheme. Measurement 2019, 138, 433–445. [Google Scholar] [CrossRef]
Zhang, M.; Chen, X.; Li, W. Hidden Markov models for pipeline damage detection using piezoelectric transducers. J. Civ. Struct. Health Monit. 2021, 11, 745–755. [Google Scholar] [CrossRef]
Available online: https://www.slb.com/products-and-services/delivering-digital-at-scale/software/olga/olga-dynamic-multiphase-flow-simulator (accessed on 14 September 2024).
Ramezani, A.; Moshiri, B.; Abdulhai, B.; Kian, A. Estimation of free flow speed and critical density in a segmented freeway using missing data and Monte Carlo-based expectation maximisation algorithm. IET Control Theory Appl. 2011, 5, 123–130. [Google Scholar] [CrossRef]

Figure 1. Flow changes after leak occurrence.

Figure 2. Pressure changes after leak occurrence.

Figure 3. Graphical representation of the dependence structure of a Hidden Markov Model, where y_t is the observable process and q_t is the hidden chain.

Figure 4. EM algorithm for AHMM.

Figure 5. Graphical representation of all probable pathways in a AHMM states S₁ and S₂. The objective of the Viterbi algorithm is to identify the most probable sequence, which is depicted by the solid line and solve the optimal state sequence problem.

Figure 6. Fitting process of linear models using the AHMM algorithm.

Figure 7. Convergence process of parameter estimation. (a)

α_{1}

, (b)

α_{2}

, (c)

β_{1}

, (d)

β_{2}

, (e)

σ_{1}

, (f)

σ_{2}

, (g)

P_{11}

, (h)

P_{22}

.

Figure 7. Convergence process of parameter estimation. (a)

α_{1}

, (b)

α_{2}

, (c)

β_{1}

, (d)

β_{2}

, (e)

σ_{1}

, (f)

σ_{2}

, (g)

P_{11}

, (h)

P_{22}

.

Figure 8. Diagram of the simulated pipeline.

Figure 9. A schematic of the designed leak detection system with k RTUs and diagram of the simulated pipeline profile.

Figure 10. Steps of the AHMM-based leakage detection method.

Figure 11. (a) RMSE for different leakage sizes using the AHMM pressure data; (b) RMSE for different leakage sizes using the AHMM flow data.

Table 1. Estimation of parameters of the AHMM for the simulated data.

Setting Number		Parameters
Setting Number		π₁	p₁₁	p₂₂	α₁	α₂	β₁	β₂	$σ_{1}^{2}$	$σ_{2}^{2}$
1	True value	1	0.9	1	2000	1688	−2	0	1	2
	Estimation	1	0.92	1	1999.922	1687.721	−1.9999	0.0009	0.622	2.13
	Standard deviation	0.0001	0.0011	0	1.8559	0.9604	0.0293	0.0314	0.938	0.1122
2	True value	1	0.9	1	2000	1844	−2	−1	1	2
	Estimation	1	0.9185	1	1999.922	1843.721	−1.9999	−0.999	0.6225	2.1322
	Standard deviation	0.0001	0.00022	0.0018	0.1806	0.0826	0.00183	0.00019	0.0294	0.0005
3	True value	1	0.9	1	2000	2000	−2	−2	1	2
	Estimation	1	0.8719	0.993	2000	1999.76	−2.002	−1.999	0.4349	2.0732
	Standard deviation	0.0002	0.168	0.3463	0.0445	0.2081	0.0009	0.0009	0.3365	0.2789
4	True value	1	0.9	1	2000	2156	−2	−3	1	2
	Estimation	1	0.9185	1	1999.922	2155.72	−1.9999	−2.999	0.6225	2.132
	Standard deviation	0.0001	0	0.0006	0.417	1.3443	0.0041	0.0032	0.1169	0.1715
5	True value	1	0.9	1	2000	2312	−1	−4	1	2
	Estimation	1	0.91859	1	1999.922	2311.72	−1.999	−3.999	0.6225	2.132
	Standard deviation	0	0.0007	0	0.5383	1.903	0.0052	0.0045	0.1647	0.2981

Table 2. Classification based on pressure.

Model	Precision	Recall	F₁ Score
AHMM	0.974783	0.980324	0.977545
K-NN	0.943281	0.945343	0.944311
SVM	0.934868	0.941408	0.938126
Logistic Regression	0.899915	0.927853	0.913671
Naive Bayes	0.945415	0.946655	0.946035

Table 3. Classification based on flow.

Model	Precision	Recall	F₁ score
AHMM	0.982782	0.973328	0.978032
K-NN	0.981434	0.993878	0.987617
SVM	0.974711	0.994316	0.984416
Logistic Regression	0.982736	0.995628	0.98914
Naive Bayes	0.954286	0.488414	0.951776

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zadehbagheri, O.; Salehizadeh, M.R.; Naghavi, S.V.; Moattari, M.; Moshiri, B. Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection. Modelling 2024, 5, 1339-1364. https://doi.org/10.3390/modelling5040069

AMA Style

Zadehbagheri O, Salehizadeh MR, Naghavi SV, Moattari M, Moshiri B. Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection. Modelling. 2024; 5(4):1339-1364. https://doi.org/10.3390/modelling5040069

Chicago/Turabian Style

Zadehbagheri, Omid, Mohammad Reza Salehizadeh, Seyed Vahid Naghavi, Mazda Moattari, and Behzad Moshiri. 2024. "Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection" Modelling 5, no. 4: 1339-1364. https://doi.org/10.3390/modelling5040069

APA Style

Zadehbagheri, O., Salehizadeh, M. R., Naghavi, S. V., Moattari, M., & Moshiri, B. (2024). Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection. Modelling, 5(4), 1339-1364. https://doi.org/10.3390/modelling5040069

Article Menu

Novel Adaptive Hidden Markov Model Utilizing Expectation–Maximization Algorithm for Advanced Pipeline Leak Detection

Abstract

1. Introduction

1.1. Background and Literature Review

1.1.1. Importance of Leak Detection

1.1.2. Existing Methods and Research Gaps

1.2. Contribution Highlights and Paper Organization

2. Principle of Leak Detection

2.1. Model and Equations

2.2. Impact of Leakage on Flow

2.3. Impact of Leakage on Pressure

3. Hidden Markov Model

3.1. Markov Chain

3.2. Typical HMM Method

4. The Adaptive Hidden Markov Model (AHMM)

4.1. Expectation–Maximization Algorithm

4.2. Efficient Calculation of the Desired Quantities

4.3. Viterbi Algorithm

5. Numerical Analysis

5.1. Practical Results of the Model

5.2. Measures of Fault Detection and Performance

5.3. Numeric Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

Appendix A. Details the Derivation Process

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI