1. Introduction
Turbulent flow, a prominent phenomenon in many natural water systems, is characterized by vigorous mixing of adjacent layers and continuous random fluctuations of the velocity at any given point. The typical metric used to measure the intensity of these fluctuations at a specific point of a flow is the turbulence intensity (TI) factor, which, in the case of 1D flow, is defined as the ratio
, where
(i.e., the velocity fluctuation) and
is the mean flow velocity at the specific point [
1,
2]. Notably, TI simplifies the coefficient of variation, the ratio of the standard deviation to the mean of
U. This connection highlights the inherent link between turbulent flow dynamics and probability theory. Since velocity (
U) acts as a random variable, theoretically it has no upper and lower limits. Instead, the probability that
U exceeds a specific threshold in a given period is estimated.
The importance of this perspective becomes more apparent when considering the implications the TI value has on the frequency of extreme values of velocity. While a typical TI value of unobstructed flows is 0.1 [
3], TI values up to 0.4 are met in hydraulic jumps [
4], whereas in the case of axisymmetric wakes behind various bodies, TI ranges from 0.3 to 1.1 [
1]. To obtain an idea of how TI influences the frequency of extreme values, suppose an intense turbulent flow where TI equals 0.4 and
U follows the normal distribution with
. Then, the standard deviation of
U will be
, and it can be easily verified (e.g., in MATLAB 2021a numerical environment with:
norminv(0.999)*0.8 + 2, which gives 4.47) the probability
U to exceed
is 0.1%. This means that with a probability of 50% (binomial distribution), 0.1% of the total duration of the turbulent flow, the flow velocity will be greater than
, i.e., more than twice the average value. This example emphasizes the importance of understanding the statistical properties of the velocity in turbulent flows for thoroughly assessing their short- and long-term effects on nearby structures.
For this reason, various researchers have introduced statistical and stochastic models in the study of turbulent flows. For example, Flores-Vidriales et al. [
5] have employed an advanced statistical analysis of measurements of the discharge of a river with a joint probability distribution function of the annual maximum and the number of events in a one-year block, in order to calculate the scour depth around bridge piers following the approach suggested by FHWA as it is described in the HEC-18 report [
6]. They also examined a stochastic approach, where the scour is estimated by the SRICOS-EFA methodology [
7] that takes as input a synthetic time series of discharge generated with an autoregressive moving integrated average stochastic model. Brandimarte et al. [
8] have also suggested the use of a stochastic model, an autoregressive fractionally integrated moving average model, as a probabilistic approach to perform the analysis for contraction scour. In both these studies the stochastic and statistical models operated with daily time steps.
The previously mentioned studies have highlighted the effectiveness of statistical and stochastic methodologies in analyzing turbulent flows using coarse time steps. Note that, from here on, we will use the term “statistical” to refer to approaches based on fitting theoretical distributions to the data, and “stochastic” for methodologies based on numerical schemes that generate synthetic time series with statistical properties similar to those observed. The application of stochastic models with fine time steps has also been suggested by researchers, though in the sense of stochastic filtering, i.e., using “
partial observations and a stochastic model to make inferences about an evolving system” [
9]. For example, He and Liu employed a stochastic model in a data assimilation process that blends low-sampling-frequency data from image velocimetry with high-sampling-frequency data from an array of microphones [
10]. Lo et al. used alternative stochastic models to evaluate the effect of instantaneous fluctuations on entrained particles when studying numerically the particle dispersion and deposition in turbulent pipe flows [
11]. Olson has employed a similar approach where a stochastic model of a turbulent fluid feeds the differential equation of convection-dispersion to study the effect of fiber length on particle dispersion [
12]. Tissot and Cavalieri proposed a stochastic formulation of the Navier–Stokes equations to study the propagation of coherent structures within a turbulent channel flow [
13]. Vianna and Nichele proposed a stochastic model to represent the annular flow in a tubular reactor followed by a numerical simulation to generate sample paths fitting the residence time distributions [
14]. Luhur et al. suggested a method based on stochastic differential equations to replace traditional look-up table methods used in wind turbine studies by simulating effective response dynamics of lift and drag forces [
15].
Despite the availability of sophisticated numerical models, physical models remain essential for studying complex hydraulic processes such as turbulent flows and their impact on hydraulic structures like piers. Modern measuring equipment capable of high-sampling frequency velocity measurements is crucial for turbulent flow studies, providing detailed insights into turbulent behavior and enabling accurate analysis of turbulence statistics and dynamics. However, measurements at higher sampling frequencies are susceptible to increased noise influence. This influence is defined as the variance of the measurement one would obtain if a constant velocity were measured, which is proportional to the noise level and sampling frequency [
16]. This noise-originating variance is compounded with the genuine variance of the measured velocity, potentially biasing the statistical properties of the measured time series. Therefore, in this study, we advocate for the use of a stochastic model to effectively reduce the negative influence of this noise, particularly in coping with the increased bias at high values, allowing for more reliable answers to critical questions, such as determining the velocity that corresponds to a given probability of exceedance. To achieve this, an innovative stochastic model was employed in ensemble simulations following the k-fold concept. A relatively new statistical analysis tool, the climacogram, has been employed, which is ideal for studying the scale invariance of turbulence processes across different temporal scales [
17]. Climacogram is the function of the variance versus time scale, wherein the time series is repeatedly aggregated to higher scales, and for each new time series obtained from each aggregation, the corresponding variance is calculated [
18].
To validate this approach, we conducted an experiment in the Laboratory of Hydromechanics and Hydraulic Engineering at the University of Siegen. This experiment captured long-duration (36 min) flow measurements. We then employed the novel stochastic model specifically designed to handle data with significant skewness, which is a challenge in the analysis of turbulent flow stresses. The stochastic model is applied in ensemble simulations, leveraging k-fold cross-validation, a commonly used technique in statistics and machine learning for evaluating predictive models [
19].
3. Results
Figure 7 displays the return period plots for
,
, and
. The horizontal red line indicates the average value of the observations. The black line represents the return period plot created using the filtered measurements, while the blue marks represent the return period plots of the 20 synthetic time series generated using the stochastic model outlined in Equation (
1). For both black line and blue marks, the (empirical) Hazen plotting position method was employed (see Table 5.4 of [
18]). The blue line represents the median of the 20 return period plots.
The value of the median of the return period plots from the synthetic
values at 2000 s is
. There are two values obtained from the filtered observations that are greater than the maximum value of the vertical axis of
Figure 7a, namely, 4.90, and
. The value of the median of the return period plots from the synthetic
values at 2000 s is
. The value of the median of the return period plots from the synthetic
at 2000 s is
. There are five values obtained from the filtered observations that are greater than the maximum value of the vertical axis of
Figure 7c, namely, 3.73, 8.38, 8.96, 9.11, and
.
Figure 8 displays the return period plots of
,
, and
. The median of the return period plots of synthetic
at 2000 s is
. There is one value obtained from the filtered observations that is greater than the maximum value of the vertical axis of
Figure 8a, which is equal to
. The median of the return period plots of synthetic
at 2000 s is
. There are two values obtained from the filtered observations that are greater than the maximum value of the vertical axis of
Figure 8b, namely, 3.11, and
. The median of the return period plots of synthetic
at 2000 s is
. There is one value obtained from the filtered observations that is greater than the maximum value of the vertical axis of
Figure 8c, which is equal to
.
4. Discussion
Figure 2,
Figure 3,
Figure 4,
Figure 5 and
Figure 6 highlight the significance of statistically analyzing the available data.
Figure 2 initially suggests potential measurement errors in the raw data. Consequently, the raw data underwent filtration following the guidelines outlined in the Vectrino user manual.
Comparing
Figure 3a with
Figure 3b (as well as
Figure 4a with
Figure 4b) makes it evident that this filtration process effectively reduces data noise. Particularly noteworthy is the steeper slope within the inertial range in
Figure 3b, indicating a noise reduction (similar to the findings in the study by Chowdhury et al. [
27]). It is important to note that the filtration process should be executed cautiously, following experimentally derived guidelines (such as those employed in this study, provided by the user guide of the Vectrino velocimeter) or based on evidence-based decisions. For instance, Romagnoli et al. [
28] implemented a more aggressive filtering approach by removing all spikes from the data, which were attributed to the presence of numerous bubbles in the flow at the measurement location. In our case study, we meticulously positioned the velocimeter probe in a location where the presence of bubbles in the flow was minimal.
Regarding the deviation of the slope in
Figure 3b within the inertial range from the theoretical value of
, this could be attributed either to the distance from the hydraulic jump (or the source of disturbance), as observed by Mukha et al. [
4] and Buchhave and Velte [
29], or to the rapidly rotating type of flow [
30,
31].
The climacogram shape of the filtered data in
Figure 4 appears more plausible compared to that of the raw data (see Figure 9.5 of [
18]). Specifically, it is expected that the climacogram curve will be almost horizontal on the left side of the plot, indicating minimal variance reduction with aggregation at small time scales. The scale at which the curve starts bending significantly downwards marks the minimum sampling frequency required to capture the stochastic structure of the studied process. In
Figure 4b, this frequency is indicated to be approximately three times lower than the 50 Hz frequency used in this study, approximately 17 Hz. Subsequently, the slope becomes steeper before transitioning to a segment with a gentler slope. The increased slope in the middle segment of the curve is likely due to periodicity in the data, possibly caused by a stagnant wave upstream of the sluice gate (compare with the middle segment of Figure 9.5 of [
18], which corresponds to the annual periodicity of river flow). The gentle slope of the final segment indicates significant persistence in the studied process. The Hurst coefficient
H, which is dimensionless and ranges from 0 to 1, is directly related to this slope
s via the formula
. The Hurst coefficient in
Figure 4 is approximately 0.84, which is within the range of values reported in the literature. For example, for turbulent flow around a jet, Hurst coefficient values between 0.7 and 0.9 have been reported [
32].
Lastly,
Figure 5 and
Figure 6 offer valuable insights into the fundamental characteristics of the stress-related time series. Preserving these characteristics, including skewness, is crucial for the reliable analysis of the turbulent flow impacts.
The drag force and Reynolds stresses play a pivotal role in characterizing the stresses exerted on hydraulic structures. These are determined by the squared velocity
and the product of velocity fluctuation components
. In the design of hydraulic structures, obtaining accurate estimations of these quantities for a given probability of failure is crucial. This involves first computing the intensity and duration of the design discharge based on the probability of failure. Subsequently, the values of
and
—corresponding to the design discharge—need to be determined. Caution must be taken when deriving these values from a physical model to avoid potential inaccuracies resulting from simplistic methods. For instance, in the case of
Figure 7 and
Figure 8, while the average value is significantly lower than the higher values, some extreme values are unrealistic, most likely due to measurement errors. To address these challenges adequately, it is advisable to report, for the given probability of failure, the
or
value based on a return period plot, as demonstrated in
Figure 7 and
Figure 8.
While one might argue against the necessity of a stochastic model, return period plots derived directly from data can be prone to errors, as demonstrated in
Figure 7 and
Figure 8. Measurement errors, particularly in the range of high values, can distort the shapes of these plots, undermining their reliability. Moreover, even with minimal noise in measurements, a statistical or stochastic approach becomes essential when the return period corresponding to the probability of failure exceeds the duration of observations. To illustrate this point, let us consider an example. Suppose the data collected from the physical model corresponds to an extreme event with an expected duration of 2000 s, and we need to find the value of
corresponding to a 4% probability of failure. The return period of observations in
Figure 7a exhibits an unrealistic sharp rise in the high-value range, likely due to errors in the measurement. Therefore, to obtain a more reliable estimate, we should use the median value of the return period plots of the synthetic time series at 2000 s, which is
. To calculate the probability of failure for this value for 2000 s (note the intentional coincidence in this example of the maximum obtainable return period from the available data and the duration of the design event), we can use the following formula:
where
T is the return period and
n is the number of time steps (in the same time unit as
T) corresponding to the duration of the design event.
Using Equation (
3) with
and
yields
, which is much higher than the target of 4%. To determine the value of
corresponding to
, it is necessary to generate a synthetic time series with a sufficiently large length. This length must be adequate for calculating the
value corresponding to a return period of 50,000 s (i.e., setting
and
in Equation (
3) results in a
). For the specific time step in this study (1/50 s), this suggests that a synthetic time series of a minimum length of 2,500,000 values is required (if ensemble simulations, then multiple time series of this length). This implies that even if the return period plots were not distorted in the range of high values of return periods (see black lines from
Figure 7a,c), observations would still need to be 25 times longer than what is currently available (i.e., instead of running the experiment for 36 min, to run it for 15 h). Running the stochastic model to generate a time series of sufficient length (more than 4,000,000 values), we estimated the value of
corresponding to a return period of 50,000
to be 2.1
.
Figure 7a,c illustrates how measurement errors can significantly impact return period plots derived from the observed data, leading to unreliable estimates. To address this issue, ensemble simulations of a stochastic model with the k-fold method offer a solution. By generating multiple synthetic datasets (in this case 20), we can mitigate the impact of outliers and obtain a set of plots for the return period instead of a single one. This set provides valuable insights into the uncertainty associated with the measurements. The median values from this set offer robust estimates that are less affected by measurement errors, ensuring safer or economical design considerations.
Regarding the choice of observation and simulation time step (sampling frequency), while larger time steps tend to filter out observational noise, they may also smooth out the flow bursts recorded in observations. Bursts are a significant characteristic of turbulent flows, particularly in TI values of medium range [
33,
34]. Various probabilistic approaches, such as Gessler’s study on entrainment, are based on the concept of bursts [
35]. Gessler proposed a probabilistic approach to partially offset the smoothing effect caused by temporal aggregation resulting from the finite frequency of sampling in any physical process. This is achieved by increasing the “confidence” for higher values when the standard deviation of the time series is lower. Similar approaches could be explored to address the smoothing effect on other types of stresses exerted by turbulent flows on hydraulic structures.
As noted by other researchers (e.g., Luhur et al. [
15]), gaining deeper insight into the underlying process requires proper representation of the conditional probability density function (PDF) of a stochastic process
,
, as it contains comprehensive information regarding the process evolution dynamics. However, it has been observed that the stochastic model represented by Equation (
1) may not adequately capture the decay of the autocorrelation function in stochastic processes that exhibit significant persistence (Koutsoyiannis et al. [
18]). In such cases, a generalized moving average scheme may be a more appropriate option for stochastic simulations.