A Lagrangian Backward Air Parcel Trajectories Clustering Framework

Rădulescu, Iulia-Maria; Boicea, Alexandru; Rădulescu, Florin; Popeangă, Daniel-Călin

doi:10.3390/w13243638

Open AccessArticle

A Lagrangian Backward Air Parcel Trajectories Clustering Framework

Computer Science and Engineering Department, University Politehnica of Bucharest, 060042 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Water 2021, 13(24), 3638; https://doi.org/10.3390/w13243638

Submission received: 25 November 2021 / Revised: 4 December 2021 / Accepted: 13 December 2021 / Published: 17 December 2021

(This article belongs to the Special Issue Smart Water Solutions with Big Data)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Many studies concerning atmosphere moisture paths use Lagrangian backward air parcel trajectories to determine the humidity sources for specific locations. Automatically grouping trajectories according to their geographical position simplifies and speeds up their analysis. In this paper, we propose a framework for clustering Lagrangian backward air parcel trajectories, from trajectory generation to cluster accuracy evaluation. We employ a novel clustering algorithm, called DenLAC, to cluster troposphere air currents trajectories. Our main contribution is representing trajectories as a one-dimensional array consisting of each trajectory’s points position vector directions. We empirically test our pipeline by employing it on several Lagrangian backward trajectories initiated from Břeclav District, Czech Republic.

Keywords:

Lagrangian backward trajectories; clustering; HYSPLIT; position vectors

1. Introduction

Many researchers rely on Lagrangian backward trajectories clustering to extract important insights regarding water-related extreme weather phenomena, such as heavy rainfalls, violent storms, floods, or diffuse water pollution (Hao et al. [1], Juhlke et al. [2] Karaca et al. [3], and Borge et al. [4]).

To support their studies, we propose a complete, efficient, and accurate framework for Lagrangian backward trajectories visualization and cluster analysis.

Our method is especially suitable for investigating the causes of particular meteorological events: one can employ it directly on a trajectory file instead of combining the results of several complicated tools, thus simplifying and standardizing the process.

We improve the performance and correctness of the clustering operation by combining lightweight trajectory representation with a flexible and accurate clustering algorithm.

To validate our method, we use a real-world example with applications in flood risk management.However, we do not explain the results from the meteorological point of view since this is out of this paper’s scope.

In the following paragraphs, we briefly describe the proposed pipeline.

First, we use a free online tool to compute several air parcel trajectories starting from a specific location, relying on archived gridded meteorological information, such as the wind vector at different time intervals. We choose the recent extreme meteorological events in the Czech Republic (June 2021) to exemplify our method, thus initiating the trajectories in the Břeclav District. Before employing the actual clustering operation, we apply several transformations on the trajectories: (i) we convert the geographical coordinates into Cartesian coordinates, (ii) we normalize the resulting values, (iii) we translate the trajectories’ points relative to the Břeclav District’s location, and (iv) we represent trajectories as one-dimensional arrays, consisting of each point’s position vector’s direction. To cluster the preprocessed data, we choose a recent clustering algorithm called DenLAC [5] due to its flexibility. The DenLAC algorithm accurately clusters various cluster types: spherical, elongated, and with different sizes and densities. Finally, we validate our results using three popular internal quality measures: (i) the Davies–Bouldin Index (DB) [6], (ii) The Calinski–Harabasz Index (CH) [7], and (iii) the Silhouette Coefficient [8].

The essence of our method lies in the way we preprocess the trajectories: we represent each trajectory as the set of its points’ position vectors directions. Hence, curve-specific dissimilarity measures such as Fréchet or Hausdorff distance are no longer needed, thus simplifying the clustering process.

The rest of this paper is structured as follows: in Section 2, we point out several research papers related to Lagrangian trajectory modeling; the Section 3 defines the concepts that our method relies on; in Section 4, we describe our framework in detail, focusing on the way we express trajectories as one-dimensional arrays; we evaluate our results in Section 5 using three popular internal measures; and finally, we draw the conclusions in Section 6.

The abbreviations and acronyms used throughout this paper are displayed in Table 1 and are also defined upon their first appearance.

2. Related Work

Lagrangian trajectory modeling is used in numerous studies regarding atmosphere moisture paths. The research papers mentioned in the following paragraphs use the Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) model to simulate backward air parcel trajectories.

Hao et al. [1] rely on clustering and Lagrangian trajectory modeling to identify the sources of several atypical climatic events in southern China, focusing on the Yunnan Province and Guangxi Zhuang Autonomous Region. To delimit the considered areas based on the origin of the water vapors that affect them (either the Bay of Bengal or the South China Sea), the authors generate multiple Lagrangian backward air parcel trajectories initiated in certain observation stations. The meteorological data used to simulate the trajectories were gathered between 2013 and 2016, from April to October, by the Global Data Assimilation System (GDAS). Hao et al. then cluster the trajectories, employing: (i) a hierarchical approach, which yields individual water vapor channels, used to analyze water vapor transport passage during the entire rainy season and (ii) a software tool for trajectory clustering and visualization to analyze monthly water vapor transport passage.

Rapolaki et al. [9] research the moisture sources for the heavy rains in Limpopo River Basin, South Africa by applying the HYSPLIT model to generate backward trajectories. The trajectories are initiated from the center of Limpopo River’s Basin, at 1500 m above sea level, four times a day, and tracked backward at an hourly interval for 10 days (the average residence time for moisture in the atmosphere). The HYSPLIT model was run for 36 years (1981–2016) using the National Center for Environmental Prediction (NCEP)/DOE NCEP II Reanalysis data.

Similarly, Juhlke et al. [2] use the HYSPLIT trajectory model to analyze the precipitation sources in the Pamir Mountains, Tajikistan, Central Asia, to identify and quantify the influence of the Mediterranean humidity based on the deuterium excess. The authors also investigate the isotopic composition of the Mediterranean water volumes.

3. Methodology

In this section, we present the theoretical notions, algorithms, and tools our framework uses: Lagrangian backward trajectory modeling of air parcels, trajectory clustering, the DenLAC algorithm, and the HYSPLIT (Hybrid Single-Particle Lagrangian Integrated Transport model) [10] trajectory modeling system.

3.1. Lagrangian Trajectory Model

Lagrangian trajectory modeling is especially helpful when investigating humidity global transport and regional humidity recycling [11], thus allowing extreme precipitation events information extraction [12]. To identify the meteorological phenomena that caused floods in a specific area and time frame, the humidity source and its transport paths must be analyzed, starting from the affected area back in time [12].

This process relies on backward air parcel trajectories computation.

We define an air parcel as an imaginary volume of air possessing any or all of the basic dynamic and thermodynamic properties of atmospheric air, where the following constraints apply:

the atmospheric conditions (such as humidity or pressure) are the same anywhere inside the air parcel;
the air inside an air parcel is isolated from the exterior; thus, there is no heat exchange between the interior and exterior of an air parcel;
the air parcel does not have a specific dimension but must be large enough to contain a significant number of molecules.

An air parcel’s trajectory intuitively represents its path in space and time, under the wind’s influence. Trajectories can be computed backward or forward in time: for diagnosis, using archived values of the wind vector (such as the ones provided by GDAS — Global Forecast System (GFS)), respectively, for forecasting, using wind forecast models [13].

The majority of the Lagrangian trajectory modeling tools rely on the same equation [14] (Equation (1)):

\frac{d x}{d t} = u (x)

(1)

where

x = (λ, ϕ, p)

stands for the geographic location vector and

u = (u, v, ω)

represents the three-dimensional wind vector [14]. Equation (1) defines how air parcels evolve in time under the influence of the wind, and its solution yields the air parcels’ paths.

Starting from a geographic location defined by its latitude, longitude, and altitude:

x_{0} = (l a t_{0}, l o n g_{0}, a l t_{0})

at timestamp

t_{0}

, the next location of an air parcel are computed as follows, relying on Equation (1) [14]:

x_{1} = x_{0} + u (x_{0}, t_{0}) \cdot Δ t

(2)

where

u (x_{0}, t_{0})

represents the wind vector at location

x_{0}

and timestamp

t_{0}

.

For the next trajectory points, the wind vector is computed as the mean between its initial position and its previously estimated position [14]:

u_{1} = \frac{1}{2} [u (x_{0}, t_{0}) + u (x_{1}, t_{0} + Δ t)]

(3)

3.2. HYSPLIT (Hybrid Single-Particle Lagrangian Integrated Transport Model)

HYSPLIT (Hybrid Single-Particle Lagrangian Integrated Transport model) [10] is a complex system offered by NOAA (National Oceanic and Atmospheric Administration), frequently employed for computing air parcel trajectories and implementing air transport and dispersion simulations.

The HYSPLIT model processes gridded meteorological data extracted at regular time intervals to output backward air parcel trajectories. A specific example is the generation of backward air parcel trajectories initiated in Istanbul using gridded meteorological data retrieved hourly during a time interval of five days. The Global Data Assimilation System (GDAS) provides such datasets free of charge; it gathers various observation types, such as surface observations, data from meteorological balloons, wind profiles, and satellite data and places them on a gridded model space.

The online version of the model https://www.ready.noaa.gov/HYSPLIT.php, accessed on 10 December 2021. outputs the computed data in an ASCII file. The file contains plenty of information such as the trajectories’ direction (backward or forward in time), the number of meteorological grids used in the calculations, the vertical movement calculation method, and the total number of trajectories. However, for this paper, the most relevant information consists of the latitude and longitude of each trajectory’s point and the diagnosis variables (pressure and humidity).

3.3. Air Parcel Trajectory Clustering

The process of grouping similar objects together while keeping dissimilar objects separate is called clustering. Trajectory clustering is a popular research topic in the field of trajectory data mining and is used to discover common movement behaviors [15].

A trajectory

T_{i}

is defined as a sequence of observations

(P_{i 1}, P_{i 2}, \dots, P_{i n})

consisting of n tuples

P_{i} = (x_{i 1}, y_{i 1}, t_{i 1})

, where

x_{i 1}

and

y_{i 1}

are the spatial coordinates and

t_{i 1}

is the timestamp [16].

P_{i j}

may contain other relevant variables, for example, the pressure, altitude and humidity in the case of air parcel trajectories.

In this paper, we consider only the spatial coordinates: latitude and longitude. This is because the timestamp is already considered when computing the spatial coordinates from the wind vector in Equation (2). Thus, we represent an observation

P_{i}

as:

P_{i} = (l a t i t u d e_{i}, l o n g i t u d e_{i})

.

Measuring the dissimilarity between two trajectories is essential in the clustering and evaluation process. Given the previous representation of an air parcel trajectory, we require a metric that can handle curves, such as the Fréchet distance [17].

The Fréchet distance between two curves

f : [a, a^{'}] \to V

and

g : [b, b^{'}] \to V

, where

a, a^{'}, b, b^{'} \in R

and V represents the euclidean vector space, is computed as the infimum over all possible reparametrizations

α

and

β

of

[0, 1]

of the maximum over all

t \in [0, 1]

of the distances between

f (α (t))

and

f (β (t))

, and is formally defined in Equation (4) [17].

d_{F} (f, g) = sup_{\begin{matrix} α : [0, 1] \to [a, a^{'}] \\ β : [0, 1] \to [b, b^{'}] \end{matrix}} max_{t \in [0, 1]} | | f (α (t)) - g (β (t)) | |

(4)

We note that, since we employ a custom transformation on the trajectories to increase accuracy and performance (which we detail in Section 4), we use the Fréchet distance only during result evaluation.

3.4. The DenLAC Algorithm

DenLAC (Density Levels Aggregation Clustering) [5] is a hybrid clustering algorithm characterized by high flexibility as its results are independent of the input dataset’s shape and distribution. It combines several popular notions from data mining and statistics, such as: the probability density function, Kernel Density Estimation, density levels, and density-based and hierarchical clustering. We use DenLAC to perform the actual trajectory clustering due to its ability to correctly discover clusters of various shapes and sizes, especially for low-dimensional datasets. However, our framework supports any other clustering method. Thus, the following detailed information regarding DenLAC’s functionality is intended for a deep, comprehensive insight of the proposed method rather than for basic understanding.

To detail this algorithm’s approach and pipeline, we briefly introduce some of the concepts it relies on: (i) the probability density function, (ii) Kernel Density Estimation, and (iii) density levels.

3.4.1. The Probability Density Function

The probability density function describes the probability distribution of a continuous random variable; more simply put, it yields the likelihood that a value sampled from that continuous random variable belongs to a specified interval. The probability density function of a continuous random variable X on range

[a, b]

is formally defined in Equation (5).

f (a < X < b) = \int_{a}^{b} f (d) d x

(5)

The probability density function is high in the regions with numerous crowded objects and lowers in the areas where the objects are few and sparsely distributed. Thus, it is an appropriate measure of objects’ cohesiveness in the density-based clustering process.

3.4.2. Kernel Density Estimation

Kernel Density Estimation [18] (KDE for short) is a method for estimating the probability density function f of a dataset D using only the observed data (the objects in D). The KDE method adds up special functions centered at each data point in D to create a smooth curve that approximates the real probability density function.

3.4.3. Density Levels

The density levels of a set D of objects are the regions for which the density value is equal to or above a particular level

λ

[19,20] and are formally defined as:

Definition 1

(Density Levels). Given a level λ, the λ-density level is:

L_{λ} = {x : f (x) \geq λ}

, where

f (x)

is the density estimate for the object x.

3.4.4. DenLAC Fundamentals and Pipeline

The key idea of DenLAC is expressing clusters as adjacent intervals of densely distributed objects organized into well-delimited connected components, called density bins. The authors formally define density bins as the set differences between neighboring density levels (Definition 1) as follows:

Definition 2

(Density Bins). For a given dataset D with the probability density function

f (\bar{D})

and n density levels

L_{λ_{i}}

, where

i \in (0, n)

, we define density bins as:

B_{i} = L_{λ_{i}} \ L_{λ_{i - 1}}

.

DenLAC consists of five consecutive steps:

estimation of the probability density function of the input dataset through employing a non-parametric density estimation method—Kernel Density Estimation;
outlier identification and displacement, applying the Inter Quartile Range method on the probability density function, computed at the previous step;
assignation of each input dataset object to a density bin, after re-estimating the probability density function on the filtered dataset; the objects are allocated to their corresponding density bins according to their density probability value, using a histogram;
extraction of the connected components comprising each density bin, using the nearest neighbors approach;
merging the previously computed connected components to yield the final clusters; the connected components are combined hierarchically, based on the minimum distance between them.

4. Method Pipeline

The proposed method’s pipeline (graphically displayed in Figure 1) consists of the following steps:

employing the HYSPLIT trajectory model to generate some backward trajectories initiated in the region of interest;
preprocessing the previously computed trajectories to improve the accuracy and efficiency of the clustering operation; this phase contains the essence of our method: representing each trajectory as the set of the angles between its points’ position vectors and the $O x$ axis;
applying the DenLAC clustering algorithms on the preprocessed trajectories data;
evaluating our results using several internal measures. To ensure correctness we assign the initial trajectories to the computed clusters and use the Fréchet distance to determine the dissimilarity between two trajectories.

Concisely, after generating the Lagrangian backward trajectories using HYSPLIT, we transform them into one-dimensional arrays (we detail this process in the following subsection, Section 4.1), and then feed them to a given clustering algorithm.

4.1. Expressing Trajectories as One-Dimensional Arrays

As we show in Section 3.3, a trajectory is a set of bidimensional observations. Consequently, the algorithm employed in the clustering process must either know how to process curves or distance matrices. Moreover, to determine the distance between two trajectories, curve-specific dissimilarity measures (such as the Fréchet distance) must be used.

Representing trajectories as multidimensional points instead of curves would significantly speed up and simplify the clustering process. In the following paragraphs, we propose a method to achieve this.

Given a trajectory

T_{i}

consisting of a set

(P_{i 1}, P_{i 2}, \dots, P_{i n})

of n bidimensional points, we note with

V = (\vec{v_{i 1}}, \vec{v_{i 2}}, \dots, \vec{v_{i n}})

the set of the points’ position vectors.

Given a point

P_{i j} = (x_{i j}, y_{i j})

, its position vector’s direction is defined as

θ_{i j} = t a n^{- 1} (\frac{y_{i j}}{x_{i j}})

. We represent a trajectory

T_{i}

relying on its points position vectors’ as

T_{i} = θ_{i 1}, θ_{i 2}, \dots, θ_{i n}

. To clarify our approach, we graphically display the position vector direction of one trajectory point in Figure 2:

θ_{11}

is point

P_{11}

’s position vector direction.

The above transformation is possible because all backward trajectories satisfy two constraints: (i) they start from the same point (as per definition), and (ii) they are approximately linear in the sense that they generally do not suffer frequent or sharp changes (since they are modeled using wind vectors at consecutive time intervals).

After the aforementioned transformation, we can define the distance between two trajectories

T_{i}

and

T_{j}

in terms of

θ_{i k}

and

θ_{j k}

as in Equation (6):

d_{θ} (T_{k}, T_{l}) = \{\begin{matrix} \frac{\sum_{i = 1}^{n_{θ}} | | θ_{k i} | - | θ_{l i} | |}{n_{θ}}, & if q u a d (θ_{k i}) = q u a d (θ_{l i}) \\ \frac{\sum_{i = 1}^{n_{θ}} | | θ_{k i} | + | θ_{l i} | |}{n_{θ}}, & if q u a d (θ_{k i}) mod 2 = q u a d (θ_{l i}) mod 2 \\ \frac{\sum_{i = 1}^{n_{θ}} 360 - | | θ_{k i} | + | θ_{l i} | |}{n_{θ}}, & if q u a d (θ_{k i}) mod 2 \neq q u a d (θ_{l i}) mod 2 \end{matrix}

(6)

where

q u a d (θ)

is

θ

’s geometric quadrant,

∥ θ ∥

represents

θ

’s absolute value, and

∥ θ_{i} - θ_{j} ∥

is the absolute value of the difference between angles

θ_{i}

and

θ_{j}

.

In the following paragraphs, we explain the need for Equation (6) using graphical examples.

Intuitively, the distance between two sets

(θ_{11}, θ_{12}, \cdot, θ_{1 n})

and

(θ_{21}, θ_{22}, \cdot, θ_{2 n})

is the mean of the distances between each coordinate:

\frac{\sum_{i = 1}^{n} θ_{1 i} - θ_{2 i}}{n}

. However, since angles naturally belong to the interval

[0, 360]

, but we express

θ_{i j}

as a signed fraction between the

y_{i j}

and

x_{i j}

coordinates we must adapt the angle distance function.

For example, in Figure 2b, the correct distance between the directions of the position vectors corresponding to

P_{41}

and

P_{51}

is angle

β

. Similarly, the correct distance between the directions of the position vectors corresponding to

P_{21}

and

P_{31}

is angle

α

. We generalize this observation in Equation (6) by accounting for each angle

θ

’s quadrant.

5. Experimental Results

In this section, we provide the experimental setup details, describe the dataset used to test the proposed framework and the employed evaluation measures and discuss the results.

We focus on validating our method by appraising the quality of the resulting trajectory clusters rather than extracting meteorological insights from the test dataset. We are particularly interested in demonstrating that the trajectories representation described in Section 4.1 is correct and yields reliable results when combined with the DenLAC clustering Section 3.4.

5.1. Experimental Setup

We run our experiments on an Oracle Linux Server 7.6 machine with 64 GB RAM and 40 CPUs. The source code is implemented in Python 3.7.7 and is publicly available online at the following link: https://github.com/IuliaRadulescu/WaterMgmt, accessed on 10 December 2021.

5.2. Dataset

To exemplify our method, we analyze the extreme phenomena that took place this summer in the Czech Republic: the tornado followed by “tennis ball-sized hailstone”. We choose the Břeclav District as trajectories source (48.7548 latitude, 16.8860 longitude). We search for air currents at the following altitudes: 3 km, 4.5 km is 6 km over sea level. We use GDAS1 (Global Data Assimilation System https://www.ready.noaa.gov/gdas1.php, accessed on 10 December 2021) archived information for the 9 June 2021–22 June 2021 time interval.

To generate the backward trajectories we use HYSPLIT, one of the most popular trajectory modeling tools, which we describe in detail in Section 3.2.

5.3. Evaluation Measures

To validate our pipeline, we must evaluate the quality of the trajectory clusters. For this purpose, there are two approaches: (i) evaluate the results internally, based on specific properties (for example, cluster compactness and separation) and (ii) compare the results with ground truth (for example, a manual clustering of the input dataset’s trajectories provided by an expert).

Since there is no ground truth already available for our dataset, we must employ internal evaluation measures to assess our method’s accuracy. Similar to Cui et al. [21], who evaluated three distinct clustering methods on air parcel Lagrangian trajectories data, we use: the Davies–Bouldin Index (DBI) [6], the Calinski–Harabasz Index (CH) [7], and the Silhouette Coefficient [8].

The Davies–Bouldin Index (DB) [6] is a generic cluster separation measure that accounts for a group’s compactness and separation [22], relying on the similarity between each cluster and its most similar peer [6]. The index is computed as the ratio of the distance within the clusters to the distance between the clusters [21] and is formally defined in Equation (7):

D B = \frac{1}{K} \sum_{i = 1}^{K} max_{j, i \neq j} (\frac{a v g (C_{i}) + a v g (C_{j})}{d (μ_{i}, μ_{j})})

(7)

where K is the total number of clusters

(C_{1}, \dots, C_{K})

,

μ_{i}

is the centroid of cluster

C_{i}

,

d (μ_{i}, μ_{j})

stands for the distance between centroids

μ_{i}

and

μ_{j}

, and

a v g (C_{i})

represents the average distance between

C_{i}

’s objects and its centroid

μ_{i}

.

A small value for the DB Index indicates well-separated clusters.

The Calinski–Harabasz Index (CH) [7] describes the average dispersion degree of a clustering [21] relying on the average inner and outer sum of squares [21]. The CH index is formally defined in Equation (8):

C H = (\frac{\sum_{k = 1}^{K} n_{k} \cdot d^{2} (μ_{k}, μ)}{K - 1}) / (\frac{\sum_{k = 1}^{K} \sum_{i = 1}^{n} d (x_{i}, μ_{k})}{n - K})

(8)

where n is the total number of objects

(x_{1}, \dots, x_{n})

in the input dataset, K represents the total number of clusters

(C_{1}, \dots, C_{K})

,

n_{k}

is the number of objects in cluster

C_{k}

,

μ

denotes the centroid of the entire dataset,

μ_{i}

represents cluster’s

C_{i}

centroid, and

d (x_{i}, x_{j})

stands for the distance between objects

x_{i}

and

x_{j}

.

The higher the CH index value, the better the clustering, since the distance between clusters increases with the value of the CH index [21].

The Silhouette Coefficient [8] is a popular quality function based on the pairwise difference of inter- and intra-cluster distances [22]. The Silhouette Coefficient is computed for each of the dataset’s objects and indicates the degree of membership of an object to its cluster [21]. Its formal definition is displayed in Equation (9):

S C (x_{i}) = \frac{b (x_{i}) - a (x_{i})}{m a x {a (x_{i}), b (x_{i})}}

(9)

where a represents the average distance between an object

x_{i}

and the other objects from its cluster and b represents the average distance between an object

x_{i}

and the other objects in the next nearest cluster.

The Silhouette Coefficient ranges from −1 to 1. Objects with high cluster membership are characterized by values close to 1. We average the Silhouette Coefficient values to evaluate our results’ accuracy.

5.4. Results

We apply the quality functions described in the previous section on our clustering results using both the proposed trajectory transformation method and the classic approach as follows: when computing the DB and CH Indices and the Silhouette Coefficient (i), we use the transformed trajectories and the dissimilarity measure defined in Equation (6) for one-dimensional trajectory-oriented evaluation method, and (ii) we use the original Cartesian trajectories with the Fréchet distance as a measure of dissimilarity between two trajectories for the typical evaluation. We display the resulting values in Table 2 and Table 3, respectively. We also provide graphical representations in both Cartesian (Figure 3) and geographical coordinates (Figure 4).

Since the DB and CH Indices are not bounded to an interval, we compare their values with the ones associated with two edge case scenarios:

the unbalanced clustering: one or two trajectories each belong to their cluster, while the rest of the trajectories are assigned to a single, large cluster;
the random clustering: trajectories are assigned randomly to two or three clusters.

Additionally, we compute the Silhouette Coefficient for the edge case scenarios.

The most relevant quality function is the Silhouette Coefficient: in Table 2, we observe that the Silhouette Coefficient values for the one-dimensional trajectories-oriented evaluation method are significantly higher than the ones obtained for the edge case scenarios: 0.59 versus 0.166 and 0.051 for 2 clusters and 0.798 versus 0.180 and −0.087 for 3 clusters.

However, for the typical evaluation, the Silhouette Coefficients of the clustering results and the ones of the unbalanced edge case scenario are similar: 0.208 versus 0.216 for 2 clusters and 0.251 versus 0.181 for 3 clusters (Table 3).

This is an abnormal behavior since any clusters should score higher than the unbalanced clusters.

Moreover, the Silhouette Coefficient values obtained applying the one-dimensional trajectories-oriented evaluation method are noticeably higher than the ones computed using typical evaluation: 0.59 versus 0.208 for 2 clusters, respectively, 0.798 versus 0.251 for 3 clusters.

These results indicate that representing trajectories as one-dimensional arrays yields more consistent evaluation results. The Silhouette Coefficient values also show that the optimal number of clusters is 3.

The DB Index is lower for the computed clusters as compared to the random ones in both evaluation approaches: 0.561 versus 10.08 (Table 2) and 0.47 versus 3.063 for two clusters (Table 3), respectively 5.591 versus 18.79 (Table 2) and 0.55 versus 4.209 for three clusters (Table 3). This is expected because the DB Index is a measure of cluster separation, which is small when randomly assigning objects to clusters. Although the DB Index is always higher for the unbalanced dataset, its values are strongly influenced by which trajectories were assigned to their clusters.

For both evaluation approaches, the CH Index is substantially higher for the computed clusters (21.68 and 28.311 for 2 clusters, respectively, 28.211 and 22.560 for three clusters) than for the edge case clusters (which score below 9).

6. Conclusions

In this paper, we propose a Lagrangian backward trajectory clustering framework that aims to support meteorological research and focuses on easing the analysis of water vapor pathways. Some of its practical applications are: gathering essential information concerning floods and heavy storms, delimiting water vapors’ areas of influence in the context of the ongoing climate change, and finding the sources of diffuse water pollution.

The advantages of our method are:

providing a complete system, from trajectory generation and preprocessing to the visualization of the final;
improving the performance of the clustering process by representing trajectories as one-dimensional arrays; for this purpose, we define a custom, easy to compute (thus significantly efficient) dissimilarity measure;
improving the accuracy of the clustering process by employing a flexible clustering algorithm called DenLAC, that can handle various types of clusters: elongated, spherical, of different sizes and densities, with noise and outliers.

We validate our pipeline on a real trajectory dataset, generated using the meteorological data related to the tornado that struck several districts from the Czech Republic this summer. We display and discuss the results in Section 5 and Section 5.4.

In the future, we propose comparing our framework with the classic approach in terms of performance (runtime in seconds). We expect our method to be faster since we use a lightweight distance function. We also plan to re-evaluate our framework’s accuracy on a dataset that provides ground-truth clusters, using external measures, which are more reliable. Another direction of future work is applying our trajectories clustering framework for air traffic monitoring given a specific source (an airport, for example).

Author Contributions

Conceptualization, I.-M.R.; Methodology, A.B.; Resources, F.R.; Software, I.-M.R.; Supervision, A.B.; Validation, A.B.; Visualization, F.R. and D.-C.P.; Writing—original draft, I.-M.R.; Writing—review & editing, F.R. and D.-C.P. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by University Politehnica of Bucharest through the PubArt program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Python source code is publicly available online, along with the dataset used in the experiments, at the following link: https://github.com/IuliaRadulescu/WaterMgmt, accessed on 10 December 2021.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hao, C.; Song, L.; Zhao, W. HYSPLIT-based demarcation of regions affected by water vapors from the South China Sea and the Bay of Bengal. Eur. J. Remote Sens. 2021, 54, 348–355. [Google Scholar] [CrossRef]
Juhlke, T.R.; Meier, C.; van Geldern, R.; Vanselow, K.A.; Wernicke, J.; Baidulloeva, J.; Barth, J.A.; Weise, S.M. Assessing moisture sources of precipitation in the Western Pamir Mountains (Tajikistan, Central Asia) using deuterium excess. Tellus B Chem. Phys. Meteorol. 2019, 71, 1601987. [Google Scholar] [CrossRef] [Green Version]
Karaca, F.; Camci, F. Distant source contributions to PM10 profile evaluated by SOM based cluster analysis of air mass trajectory sets. Atmos. Environ. 2010, 44, 892–899. [Google Scholar] [CrossRef]
Borge, R.; Lumbreras, J.; Vardoulakis, S.; Kassomenos, P.; Rodríguez, E. Analysis of long-range transport influences on urban PM10 using two-stage atmospheric trajectory clusters. Atmos. Environ. 2007, 41, 4434–4450. [Google Scholar] [CrossRef]
Rădulescu, I.M.; Boicea, A.; Truică, C.O.; Apostol, E.S.; Mocanu, M.; Rădulescu, F. DenLAC: Density Levels Aggregation Clustering—A Flexible Clustering Method. In International Conference on Computational Science; Springer: Berlin/Heidelberg, Germany, 2021; pp. 316–329. [Google Scholar]
Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun.-Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
Rapolaki, R.; Blamey, R.; Hermes, J.; Reason, C. Moisture sources associated with heavy rainfall over the Limpopo River Basin, southern Africa. Clim. Dyn. 2020, 55, 1473–1487. [Google Scholar] [CrossRef]
Stein, A.; Draxler, R.R.; Rolph, G.D.; Stunder, B.J.; Cohen, M.; Ngan, F. NOAA’s HYSPLIT atmospheric transport and dispersion modeling system. Bull. Am. Meteorol. Soc. 2015, 96, 2059–2077. [Google Scholar] [CrossRef]
Shi, Y.; Jiang, Z.; Liu, Z.; Li, L. A Lagrangian analysis of water vapor sources and pathways for precipitation in East China in different stages of the East Asian summer monsoon. J. Clim. 2020, 33, 977–992. [Google Scholar] [CrossRef]
Gustafsson, M.; Rayner, D.; Chen, D. Extreme rainfall events in southern Sweden: Where does the moisture come from? Tellus A Dyn. Meteorol. Oceanogr. 2010, 62, 605–616. [Google Scholar] [CrossRef]
Bowman, K.P.; Lin, J.C.; Stohl, A.; Draxler, R.; Konopka, P.; Andrews, A.; Brunner, D. Input data requirements for Lagrangian trajectory models. Bull. Am. Meteorol. Soc. 2013, 94, 1051–1058. [Google Scholar] [CrossRef]
Sprenger, M.; Wernli, H. The LAGRANTO Lagrangian analysis tool–version 2.0. Geosci. Model Dev. 2015, 8, 2569–2586. [Google Scholar] [CrossRef] [Green Version]
Yuan, G.; Sun, P.; Zhao, J.; Li, D.; Wang, C. A review of moving object trajectory clustering algorithms. Artif. Intell. Rev. 2017, 47, 123–144. [Google Scholar] [CrossRef]
Bian, J.; Tian, D.; Tang, Y.; Tao, D. A survey on trajectory clustering analysis. arXiv 2018, arXiv:1802.06971. [Google Scholar]
Alt, H.; Godau, M. Measuring the resemblance of polygonal curves. In Proceedings of the Eighth Annual Symposium on Computational Geometry, Berlin, Germany, 10–12 June 1992; pp. 102–109. [Google Scholar]
Chen, Y.C. A tutorial on kernel density estimation and recent advances. Biostat. Epidemiol. 2017, 1, 161–187. [Google Scholar] [CrossRef]
Chaudhuri, K.; Dasgupta, S. Rates of Convergence for the Cluster Tree. Advances in Neural Information Processing Systems. 2010; pp. 343–351. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.188.2410&rep=rep1&type=pdf (accessed on 10 December 2021).
Hartigan, J.A. Consistency of single linkage for high-density clusters. J. Am. Stat. Assoc. 1981, 76, 388–394. [Google Scholar] [CrossRef]
Cui, L.; Song, X.; Zhong, G. Comparative Analysis of Three Methods for HYSPLIT Atmospheric Trajectories Clustering. Atmosphere 2021, 12, 698. [Google Scholar] [CrossRef]
Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. Understanding of internal clustering validation measures. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, Australia, 13 December 2010; pp. 911–916. [Google Scholar]

Figure 1. Algorithm pipeline.

Figure 2. Position vectors directions and the distances between them. (a) Position vectors directions. (b) Real angle difference.

Figure 3. DenLAC, Cartesian projection. (a) 2 clusters. (b) 3 clusters.

Figure 4. DenLAC, geographic coordinates clusters. (a) 2 clusters. (b) 3 clusters.

Table 1. Abbreviations and acronyms used throughout this paper. We mention that all abbreviations and acronyms are also defined upon their first appearance.

Abbreviation	Explanation
CH	Calinski–Harabasz Index
DBI	Davies–Bouldin Index
DenLAC	Density Levels Aggregation Clustering
GDAS	Global Data Assimilation System
GFS	Global Forecast System
HYSPLIT model	Hybrid Single-Particle Lagrangian Integrated Trajectory model
KDE	Kernel Density Estimation
NCEP	National Center for Environmental Prediction
NOAA	National Oceanic and Atmospheric Administration

Table 2. Evaluation results employing our representation method.

	DB	CH	S
2 clusters	0.561	21.68	0.590
unbalanced (2 clusters)	2.604	0.369	0.166
random (2 clusters)	10.08	2.455	0.051
3 clusters	5.591	8.327	0.798
unbalanced (3 clusters)	3.835	0.330	0.180
random (3 clusters)	18.79	1.153	−0.087

Table 3. Evaluation results employing the classic method.

	DB	CH	S
2 clusters	0.470	28.211	0.208
unbalanced (2 clusters)	0.277	3.459	0.216
random (2 clusters)	3.063	8.186	0.008
3 clusters	0.550	22.560	0.251
unbalanced (3 clusters)	0.375	3.352	0.181
random (3 clusters)	4.209	6.617	−0.076

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rădulescu, I.-M.; Boicea, A.; Rădulescu, F.; Popeangă, D.-C. A Lagrangian Backward Air Parcel Trajectories Clustering Framework. Water 2021, 13, 3638. https://doi.org/10.3390/w13243638

AMA Style

Rădulescu I-M, Boicea A, Rădulescu F, Popeangă D-C. A Lagrangian Backward Air Parcel Trajectories Clustering Framework. Water. 2021; 13(24):3638. https://doi.org/10.3390/w13243638

Chicago/Turabian Style

Rădulescu, Iulia-Maria, Alexandru Boicea, Florin Rădulescu, and Daniel-Călin Popeangă. 2021. "A Lagrangian Backward Air Parcel Trajectories Clustering Framework" Water 13, no. 24: 3638. https://doi.org/10.3390/w13243638

APA Style

Rădulescu, I. -M., Boicea, A., Rădulescu, F., & Popeangă, D. -C. (2021). A Lagrangian Backward Air Parcel Trajectories Clustering Framework. Water, 13(24), 3638. https://doi.org/10.3390/w13243638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lagrangian Backward Air Parcel Trajectories Clustering Framework

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Lagrangian Trajectory Model

3.2. HYSPLIT (Hybrid Single-Particle Lagrangian Integrated Transport Model)

3.3. Air Parcel Trajectory Clustering

3.4. The DenLAC Algorithm

3.4.1. The Probability Density Function

3.4.2. Kernel Density Estimation

3.4.3. Density Levels

3.4.4. DenLAC Fundamentals and Pipeline

4. Method Pipeline

4.1. Expressing Trajectories as One-Dimensional Arrays

5. Experimental Results

5.1. Experimental Setup

5.2. Dataset

5.3. Evaluation Measures

5.4. Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI