1. Introduction
This paper proposes an extension of a generalized linear-regression-model-based approach to the identification of variance change points [
1] of functions that are observed in a fixed domain.
In particular, we describe multivariate signals by proposing a multivariate method that provides the variance change points of multivariate, heteroskedastic Gaussian variables. Although the proposed method can be applied to a wide class of contexts, e.g., the description of multivariate times series, we start from a specific research question that is related to the issue of finding the main phases of seismic curves.
Seismographs provide instrumental observations of earthquakes. A seismogram is a record of ground motion at a measuring station as a function of time. The analysis of seismograms is the basis of experimental seismology, as it is an efficient way to locate the hypocenter of an earthquake and determine its magnitude and source mechanism. A seismogram is constituted by the superposition of different wavefields that propagate up to the Earth’s surface from the hypocenter of the earthquake. The waves that radiate from the seismic source point can be divided into P- and S-waves. P-waves are the fastest of all elastic waves, especially in comparison with S-waves, which are surface and coda waves.
Therefore, seismograms show a characteristic structure: The starting oscillation corresponds to P-waves; in the central part, we generally observe an overlap with S-waves, followed by surface waves in the final part. However, this is an extreme simplification of the structure of a seismogram because, generally, the iteration of wave fields with seismic discontinuities present inside the Earth causes reflection and refraction phenomena, and the origin of new seismic phases can make the seismogram much more complex.
It is important to distinguish between the different seismic phases that appear in seismograms, especially for the location of the hypocenter. For an accurate phase detection and for the interpretation of seismic signals, defining automatic-time seismic phase picking is crucial.
For this purpose, the first issue is the recognition of the different characters of signal and noise in a variety of domains (e.g., amplitude, frequency, and angle features). There is not a single method that ensures consistent picking for any arrival. For instance, the energy approach that is most commonly used cannot be considered in the presence of a high noise level; on the other hand, bandpass filtering does not work when the noise and the signal have almost the same frequency content [
2].
The most commonly used automatic picking algorithms were reviewed in [
3]. Performance comparisons of different pickers were reported in [
4,
5,
6]. A characteristic function (CF) was introduced in [
7,
8] through one or several nonlinear transformations of a seismogram, with abrupt increases in the correspondence of the time of occurrence of a seismic wave. The authors of [
9] defined a new picking algorithm. In contrast to Allen’s squared envelope function, this CF was sensitive to changes in frequency, amplitude, and phase. Moreover, the distribution density function and some characteristics, such as the main moments (variance, skewness, and kurtosis, which are parameters of higher-order statistics (HOS) [
10]) of the seismogram, might characterize their statistical properties. Finally, the so-called autoregressive-Akaike-information-criterion picker (AR-AIC) proposed by [
4] was based on the work by [
11,
12,
13,
14].
Following [
1], in [
15], a univariate picking method based on the detection of a signal’s changes in variance was proposed, taking advantage of a formulation of the investigated problem with a generalized linear model, which is relevant for the analysis of changes in the variations of single waveforms.
In [
16], a new picking algorithm for the automatic determination of a seismic wave’s time of onset was tested on a simulated waveform dataset in order to capture the variations in performance due to the characteristics of the simulated waveform. Then, in [
17], the validity and robustness of the picking algorithm were tested on both synthetic seismograms and real data. In particular, seismic events of different magnitudes (2–5) recorded at different source distances (10–250 km) were simulated, and the results obtained were compared with those obtained by applying the conventional STA/LTA picking algorithm. Although both algorithms gave similar results in the simulations, the proposed algorithm resulted in higher flexibility and greater automation capabilities, as shown in the analysis of the real data.
In fact, the proposed picking algorithm did not require a testing or optimization phase and could be very effective for the routine analysis of earthquakes in new seismic networks or in areas in which the seismic characteristics are unknown.
In these papers, only the vertical component of the recorded seismic event was analyzed, and the main results stated that the algorithm performed best as the distance from the nearest seismic station that recorded the earthquake decreased and the magnitude of the seismic event increased. This is an expected result, as, of course, the shape of the waveform is strictly related to those characteristics of seismic events.
Therefore, motivated by the need for an algorithm that is able to pick arrival times in even more challenging settings, which turned out to be when waveforms were generated by seismic events with a low magnitude and were far from the nearest station that first recorded the event, we intended to use information from other detected components of seismic events to improve the fit of the model and, thus, the detection of seismic phases.
Hence, in this paper, we propose a multivariate automatic approach based on a generalized linear regression model for the detection of change points via a very efficient computational procedure.
We tested the performance of the proposed algorithm by simulating the locations of earthquakes with different magnitudes, which were assumed to be detected at variable distances from the nearest seismic station. This allowed us to highlight the most appropriate scenarios for the multivariate algorithm. To highlight its advantages, we further compared the results of the multivariate picking to those of its univariate counterpart. Finally, we also present an application with real data.
All of the codes were written in the R software language [
18] and are available from the authors.
The structure of this paper is as follows.
Section 2 proposes the model for the statistical picking of multivariate waveforms, and in
Section 3, its performance is assessed through a simulation study. Furthermore, an application to real data is presented in
Section 4. Finally, conclusions are drawn in
Section 5.
2. Moving from Changes in Mean to Changes in Variance
Given , where is the outcome and is the observed explanatory variable for the ith sampled unit with , let us assume that with is, for instance, a sinusoidal function representing the observed signal, and with is a variance function that is approximated by a piecewise constant regression function with segments.
Using the generalized linear model formulation, the test for the change in variance of a Gaussian random variable sequence can be equivalently transformed into a test for the change point of the rate parameter of a gamma sequence. Therefore, the changes in the variance of the original sequence are individuated in an attempt to look for changes in the mean of the residuals of the fitted linear model.
This represents a generalized reformulation of the approach of the
cumSeg model proposed in [
19], where a least-squares approach was used to detect multiple change points in genomic sequences. In fact, this paper shows a novel approach to obtaining estimates of the number and location of variations when the dispersion function is described by a piecewise constant function with unknown jump points. In particular, we also account for the multivariate nature of data while looking for change points in the variance of multivariate curves, extending the results of [
1].
Let
be the vector of the multivariate response for the
ith unit,
, with
. Assume that
, which is denoted as
for simplicity, has an exponential family distribution with the following density:
where
a,
b, and
c are known functions that characterize the distribution and
is a known dispersion parameter (refer to [
20] for the multivariate formalism). Given
g as the link function, the model for changes in
can be formulated as follows:
where
g is the link function,
is a broken-line covariate, and the
s represent
, or the locations of the changes (
change points).
represents the mean level for
, while
is the vector of the differences in the mean levels at the change points. We refer the reader to [
1,
19] for more details on the fitting procedure. In the multivariate case, we developed a model for modeling the variance change points in multivariate signals; this was still based on the estimation of the model (
2), where the response was the squared sum of Gaussian heteroscedastic variables. Following the suggestion of [
1,
21], once the mean signal
was estimated with a standard smoothing procedure, e.g., a cubic smoothing spline, we modeled the squared studentized residuals
with weights
, where
is the
ith diagonal element of the hat matrix
H, as the response of the model (
2). In the uni-dimensional case, the chosen distribution model was the Gamma one because of the strict positiveness and typically skewed dispersed nature of residuals.
Extending to a multivariate context, the Dirichlet multinomial distribution models multivariate data that exhibit over-dispersion [
22]. In particular, the Dirichlet multinomial is a multivariate extension of the beta–binomial distribution, as the multinomial and Dirichlet distributions are multivariate versions of the binomial distribution and beta distribution, respectively. In this case, the density of the
d-dimensional vector
with the parameter
is
where
. Here,
, often read as ‘
n choosing
k’, refers to the number of
k combinations from a set of
n elements.
The structure of the algorithm then follows that proposed in Adelfio [
1], with the only difference being that in the present approach, the squared residuals are modeled as responses of the Dirichlet multinomial regression model, and the tools of the MGLM package [
23] in the R software [
18] were used.
Starting from the model (
2), a user-fixed number of
change points were estimated, say
. Then, the two change points
and
, corresponding to the arrival of the first P-wave and the arrival of the first S-wave, were selected among the
points through the procedure denoted as
changepost, and this is reported in Algorithm 1.
Algorithm 1changepost |
Require:; ; ; ; |
Ensure: |
|
- 1:
fordo - 2:
if then - 3:
- 4:
end if - 5:
if then - 6:
- 7:
end if - 8:
if then - 9:
- 10:
end if - 11:
end for - 12:
- 13:
|
Basically, we computed the ratios between the variances of the subsequent phases identified by the
estimated change points. The two relevant change points
were selected in correspondence with the highest variance ratios. Of course, the larger the fixed number of change points
estimated by the model (
2) is, the larger the set of the two most potentially relevant change points selected by Algorithm 1 will be. Indeed, once the change points were selected by the proposed algorithm, we chose to distinguish those associated with the P- and S-waves based on their arrival times. Nevertheless, the first arrival time in seismograms is commonly defined as the arrival of the first P-wave.
For the problem of estimating the correct number of change points in segmented regression models, we refer the reader to [
24].
3. Simulation Study: Multivariate Sequence Waveforms
Waveforms were simulated with different characteristics, such as different magnitude values and different distances from the seismic station that first recorded the event. In particular, we resorted to the simulation approach from [
25], which involved summing the recorded seismograms from small earthquakes in the past with suitable time shifts and generating high-frequency seismograms. The dataset was made of 1000 waveforms of 300 s each (with a 250 Hz sampling step). Moreover, for each earthquake, the noise was simulated and added to the original sequence. Using a combination of the categories derived from the quantiles of the univariate distributions of the two variables (magnitude and distance from the station), we generated 16 alternative waveform scenarios. Consequently, a different number of waveforms was involved in each scenario, as shown in
Table 1.
The relevant change points were defined as the true times of arrival of the first P- and S-waves, which are denoted as
and
, respectively (i.e.,
and
).
Figure 1 displays an example of waveforms for each scenario. The vertical lines show the true arrival times. From this representation, the relationship between the shape of the waveform and the characteristics of the seismic event is evident.
We generated all three waveform components, namely, North–South, East–West, and vertical, for each earthquake. In this paper, we chose to report only the results for the two horizontal components, i.e., North–South and East–West. Furthermore, we analyzed the first 1/3 of the waveforms, as the remaining time period of the waveforms surely did not contain the true change points. Therefore, the rest of the waveforms would not contribute to the detection of the change points, but would only make the computation heavier.
Table 2 shows the empirical means and the mean squared errors in round brackets for the two horizontal components in each scenario. The reported NAs denote when no arrival times were found in the analyzed waveform.
A relevant advantage of the proposed multivariate picking method is the unique estimation of the change points over the two components, since the change points estimated along the two components were very close to each other.
The multivariate algorithm succeeded in picking the arrival times equally in each scenario. This represents a further advantage of the multivariate algorithm with respect to its univariate counterpart. Indeed, the preliminary experiments in [
16] showed that the univariate algorithm achieved the best performance with decreasing distances (from the nearest seismic station that recorded the event) and with increasing magnitude. NAs occurred often when the distance is large and the magnitude was small, indicating scenarios in which the P- and S-waves were basically indiscernible from the noise. In those cases, the change points were not easily estimable. To make the results comparable, we ran the simulations with the univariate picking algorithm with the same maximum number of estimated change points
, which was equal to 10, and we computed the mean squared errors of the multivariate estimates as the mean over the estimates for the two components, divided by their standard deviations. As shown in
Table 2, the NAs (no estimated arrival times) for the univariate case strongly depended on the characteristics of the waveforms, and they failed completely to identify any change points in some scenarios, which were those with the largest distances from the station and with the lowest magnitudes.
Furthermore, by looking at the mean squared errors, it appears evident that the univariate picking method was the best in detecting the first arrival time in some specific scenarios, while the second arrival time was detected better overall by the multivariate picking method. Indeed, we could not identify relevant differences among the different scenarios. On the contrary, the mean squared errors of the univariate picking method increased as the distance from the station and the magnitude increased.
4. Application
This section describes the application of the proposed methodology to a real dataset from the seismic sequence that occurred in L’Aquila (Italy) in 2009. The seismic events of the L’Aquila cluster were selected while considering all of the seismic events that fell in a circular area within a radius of 10 km, centered on the epicenter of the main shock in the city of L’Aquila with a magnitude of 5.9. This event, which took place on April 6 at 03:32, had the following hypocentral coordinates: latitude = 42.349 N, longitude = 13.38 E, hypocentral depth = 8.3 km. In order to collect the events with the highest signal-to-noise ratios that were simultaneously recorded by a large number of stations, a threshold magnitude of 3 was set. Events with a magnitude lower than this threshold were not taken into consideration. The only 3D broadband stations that were selected were those that recorded all of the seismic events identified with the above criteria. The final dataset was made up of 80 earthquakes recorded by 12 seismic stations. The three components of motion to which they referred (up–down, North–South, and East–West) were available for 2880 waveforms in total. Each waveform was sampled at 100 Hz, and the total length of each waveform was 100 s (10,000 samples). To increase the signal-to-noise ratio, a bandpass filter was preliminary applied, with a frequency (FIR 6) between 1 and 35 Hz. That operation was necessary in order to discard frequencies that were related to electronic and anthropic noise, which were clearly not parts of the seismic signals. Further, a normalization with respect to the waveform’s maximum amplitude was applied. This particular dataset represents a complex scenario in which the univariate algorithm failed to pick any arrival times.
We only considered events coming from the BSSO—Busso station (Italy); thus, we selected waveforms with different shapes. Having selected only events falling in a circular area with a radius of 10 km around the main shock, the distances from the seismic station were similar. The BSSO station had the following coordinates: latitude = 41.5461 N and longitude = 14.5938 E. This made it approximately 134.3 km distant from the L’Aquila main shock. Furthermore, all of the seismic events exhibited magnitude levels between 3 and 4.1, making this scenario quite challenging for the selection of the change points.
Figure 2 shows three waveforms belonging to the selected seismic sequence. The red vertical lines indicate the
change points estimated by the multivariate algorithm. The green lines indicate the two change points selected by Algorithm 1, which most likely represent the true P- and S-waves’ arrival times. Overall, the proposed multivariate procedure succeeded in picking the arrival times in 23 out of the 80 seismic events. The univariate counterpart only picked 3 times out of 80, though it started from
.
This represents an important result, as we showed that the multivariate algorithm could detect arrival times in quite complex waveforms that came from seismic events with considerably low magnitudes. Indeed, in seismic sequences, such as that which occurred in L’Aquila, the majority of the detected waveforms come from events with lower magnitudes (and, consequently, complex waveforms). The simultaneous information provided by the two components helps in the identification of the arrival times in scenarios in which only one component is not sufficient.
5. Conclusions
In this paper, we developed an efficient and simple method for detecting unknown change points in multivariate regression models with a piecewise constant variance function.
This approach can be seen as a generalization of the
cumSeg procedure from [
19] for any regression model and of the approach proposed in [
1], since we tested for changes in the variance of multivariate signals.
The multivariate version of the picking algorithm from [
17] was tested through simulations. With this aim, we simulated waveforms with different magnitudes and distances from the nearest seismic station.
The results showed an improvement in the picking of the seismic phases; therefore, the exploitation of other components’ information is crucial for the better detection of such phases. With particular reference to the analysis of real data presented in this paper, the proposed multivariate procedure succeeded in picking the arrival times correctly in 23 out of the 80 seismic events, that is, it had a success rate of about 30% of the cases, unlike the univariate method, which picked values for about 4% of the cases. This constitutes a great improvement. Nevertheless, we are aware that this percentage is quite low, and therefore, we plan to test the performance of the algorithm with additional studies of real data.
The most relevant result of the multivariate version of the picking algorithm was its ability to estimate the times of arrivals regardless of the waveforms’ shape. Apparently, the information provided by the two components that were simultaneously analyzed counterbalanced the need for a less complex shape of the waveforms, though this was necessary for the univariate picking method in order to correctly identify the arrival times.
A further advantage of the multivariate procedure in comparison with its univariate counterpart is its ability to provide a unique estimate of the change points over the two components used in the fitting procedure.
In the manual picking of P- and S-waves, it is a well-established practice to pick the P-waves’ arrival on the vertical component, while the S-waves arrive on one of the horizontal components. This is mainly because P-waves are better discriminated on the vertical component, while horizontally polarized S-waves are more easily recorded on the horizontal components. We decided not to use this observation at this stage of our proposal in order to make the algorithm generalizable to other application fields in which testing a change in variance could be crucial. Some fields of application include stock market records, the analysis of blood flows, etc. For this reason, we believe that one of the main advantages of the proposed methodology is its broad application field. This, of course, is related to its ability to detect multiple change points in a variance function; thus it is also applicable to jump-points via a computationally efficient procedure (see, for instance, Muggeo and Adelfio [
19], Huang et al. [
26], and Tibshirani and Wang [
27] for different applications using mean change point detection approaches).