1. Introduction
Water distribution systems are a crucial aspect of urban infrastructure, delivering water to consumers through a complex network of water treatment plants, pipes, pumps, tanks, and valves. Numerical models of these systems are used in a wide range of applications, including network design [
1,
2], real-time control [
3,
4], resilience analysis [
5,
6,
7], and risk assessment [
8,
9]. The development of accurate water distribution system models remains a challenge due to sparse and inaccurate utility asset data [
10,
11] and the lack of standard methods to build models from diverse data. Furthermore, while many large water utilities have invested in up-to-date and highly detailed water distribution system models, many small- to medium-sized water utilities do not build and maintain accurate models. Access to infrastructure models is further restricted due to data sharing limitations.
Advances in synthetic infrastructure model development provides a path toward developing accurate, site-specific infrastructure models from limited data. Applications include transmission and distribution power grids [
12,
13], drinking water distribution systems [
10,
14,
15], wastewater distribution systems [
16], and interdependent infrastructure [
11]. Roadways are commonly used as a proxy network structure for synthetic infrastructure models [
10,
17,
18]. When using road networks as a proxy for water systems, several key considerations need to be taken into account. For example, water utility right-of-ways do not always follow roads, road intersections define connectivity that may not be present in water system, and water systems can contain multiple pipes that run along a single road. These important differences between road and water systems could have significant impact on the hydraulics of a synthetic model.
Given the challenges of building water distribution system models from imperfect data, the use of synthetic network generation to create pipe networks for water distribution models is an active area of research. DynaViBe used structural urban data and graph theory to generate synthetic case studies [
14]. The software was extended to use additional geographic information system (GIS) data such as population and housing density [
19]. Möderl et al. [
20] built an ensemble of conceptual synthetic networks intended to represent a wide range of characteristics that could be found in real-life water distribution systems. Using water and sewer pipes from an alpine city, Mair et al. [
21] used network correlation/similarity analysis to determine that 50% percent of the road network length correlated with 80–85% of the water and sewer pipes; however, a correlation between street type and pipe diameter was not found. Using the same alpine city, Mair et al. [
22] illustrated that synthetic water distribution system models achieve high accuracy in pressure using only 30% of large-diameter pipes. Paez and Filion [
23] created water distribution system models of five real cities (one in Europe, two in North America, and one in Asia). The real and synthetic networks were compared using a wide range of global topographic and reliability related metrics, including link density, average node degree, meshedness coefficient, the density of articulation points and bridges, pressure, and a modified resilience index. Ahmad et al. [
10] created synthetic models of Tempe and the greater Phoenix area. Methods were validated using a benchmark model of North Marin, California (EPANET Net3), to compare the distribution of pipe diameters. Momeni et al. [
24] used a benchmark model (Anytown) to generate synthetic networks and apply a multiobjective genetic algorithm to optimize network attributes and correct unusual network structures that can arise with synthetic methods. Approaches to synthetic network generation using linear programming were taken by Rehm et al. [
25] in Cologne, Germany, and Zaucher et al. [
18] in Piedmont, California.
The literature for synthetic network generation has focused on the design and planning of water distribution; however, the use of synthetic networks to build proxy models for decision making during disruption scenarios (i.e., resilience analysis) is an underdeveloped area of research. While Nikolopoulos et al. [
26] quantified uncertainty in resilience by applying a stress testing framework to a synthetic network, the focus was on the demonstration of the framework rather than an evaluation of the suitability of synthetic networks as proxies for real systems. There are two research gaps that this paper seeks to address. The first gap is the underutilization of sparse datasets in synthetic network generation. In situations where only incomplete, low-quality pipe datasets are available, this would provide a way to build plausible water distribution system models despite data limitations. Leveraging an incomplete pipe dataset in conjunction with a synthetic network generation method could produce more realistic models. One notable exception is the work of Mair et al. [
22], who explored the combination of a road network with subsets of real pipe data for synthetic network generation. However, the methodology was agnostic to whether or not network edges were associated with real pipe data. The second gap is the need for the additional verification of synthetic models for decision making during disruptive scenarios (e.g., pipe breaks, water treatment plant outage). Previous work typically presented high-level statistics on pressure and pipe diameters [
19,
20] and global network statistics [
23]. However, in a water distribution system, even if the pipe diameter distribution, average pressure, and general network structure are similar, certain heterogeneous aspects of the real system (e.g., pressure distribution and response to disruption) may not necessarily be replicated. An example of synthetic model evaluation focused on spatial distributions of pressure can be found in Mair et al.’s study [
22]; they used performance indices to compare nodal pressures of synthetic models to those of a real model. Evaluating the reproduction of model pressure is particularly important when using a synthetic model for resilience analysis, where it is necessary to accurately capture the spatial relationship between damage and impact. Furthermore, it is valuable to understand how property-based (e.g., diameters, topology) measures of model similarity correlate with measures of model similarity based on hydraulic simulation and response to disruptive scenarios [
27].
This work focuses on the use of synthetic network generation methods to build water distribution system models in Puerto Rico, where up-to-date models are in short supply and water systems are subject to a wide range of disruptive events. The city of Mayagüez was chosen as the case study, where a model was built from publicly available infrastructure data for Puerto Rico as part of this work [
28]. This infrastructure dataset has also been used in other work for both water and power system analyses [
29,
30,
31]. The Puerto Rico database includes infrastructure data for drinking water, wastewater, and power across the island. While these data are incomplete and include inaccuracies, the data quality in Mayagüez is high enough to build a water distribution system model for the purpose of validation (referred to as the “validation” model in this paper). In many of the other cities and rural areas of Puerto Rico, available water infrastructure data are sparse. Pipe data can be fragmented, and entire areas can be missing data entirely. Even when a pipe location is known, the diameter is not always recorded in the dataset. Given the incomplete nature of the data in these regions, it is difficult to create models of water distribution systems exclusively using data. While fully synthetic models could be used in place of models built from data, they may fail to capture important features of the system. Taking a combined approach to model development using both synthetic methods and incomplete pipe data to generate the pipe network could result in more accurate models.
The goal of this work was to use incomplete data to generate models that more closely represent the real system during normal operations and disruptive events. We accomplish this by pursing the following objectives: (1) develop a synthetic network generation technique that can incorporate incomplete pipe datasets for the development of more realistic models and (2) provide a detailed comparison between synthetic models generated with varied amounts of pipe data and the validation model focused on topological metrics, pipe diameters, spatial differences in pressure, and pressure response to a disruptive scenario. This work is organized as follows:
Section 2 includes descriptions of the data and methods for the development of the validation model, the generation of synthetic models, and the suite of metrics and analyses used to compare the synthetic and validation models.
Section 3 presents the results of applying the metrics and analyses to the generated synthetic models and the validation model.
Section 4 includes a discussion of the results, the limitations of the experiment, and ideas for future work.
Section 5 is a brief summary of this paper and key takeaways.
2. Methods
The following section includes details on the development of a validation model for the Mayagüez water service area, the methodology used to generate data-informed synthetic models, and the suite of metrics that were used to compare the validation model to the synthetic models. The analysis was carried out using several open source Python packages, including the Water Network Tool for Resilience (WNTR) [
5] to build and run hydraulic simulations, GeoPandas [
32] to analyze geospatial data, and OSMNX [
33] to obtain the road network from OpenStreetMaps [
34].
Hydraulic analyses were ran using the EPANET simulator through the Python interface provided by WNTR. Steady-state simulations were used to compare pressure across the validation and synthetic models as described in
Section 2.3.3. Extended-period steady-state simulations were used to simulate pressure for the disruption scenario described in
Section 3.4. In both situations, pressure-dependent demand was used to ensure that demand at each junction is a function of pressure. Minimum pressure was set to 0 m, and required pressure was set to 20 m. Pressure in meters assumes a fluid density of 1000 kg/m
3 to convert pressure in Pa to meters of water pressure.
2.1. Study Area
The water distribution system in Mayagüez, Puerto Rico, was the central object of the analysis. The focus on Puerto Rico was motivated by the lack of water infrastructure models across the island, the complexity of the water systems, and the need for resilience analyses that can consider vulnerabilities like hurricanes, power outages, and component failure from aging infrastructure. In addition to advancing the available techniques for synthetic water infrastructure, focusing on a case study in Puerto Rico supports the modeling needs of the island.
Validation Model Development
The government of Puerto Rico hosts a publicly available database that includes infrastructure data that can be used to build models. For much of the island, the completeness and quality of these datasets are insufficient for the development of detailed water distribution system models. However, the data for Mayagüez are of high enough quality for detailed model development. It is important to keep in mind that the goal for this work was not to create a fully calibrated water utility model from the data. Rather, it was to create a model that includes site-specific heterogeneous details of the real system based on available data and that can be used to quantify how well synthetic network generation methods capture model behavior. The data and assumptions that were used to build the model are described in
Appendix A and summarized in the following paragraphs.
The Puerto Rico infrastructure database is available at
https://gis.pr.gov/ (accessed on 1 October 2022) [
28]. The data include water, road, and power infrastructure assets across the island and have been used in several publications [
29,
30,
31]. Data on water assets include the location, geometry, and attributes of water treatment plants, pipes, pumps, tanks, and valves stored in GIS-compatible files. While the data do not include operational data (e.g., valve and pump controls), the raw data can be used to build fairly complete models of select water service areas. Simple data quality control analysis was performed to identify regions that have adequate data for model development. This included the identification of missing assets (e.g., populated areas that have no pipes) and missing attributes (e.g., pipe data that had no diameter or roughness values). The quality control analysis identified the Mayagüez Water Service Area as a high-quality candidate for validation water distribution system model development. This water service area contained both highly populated areas with a dense grid pipe layout and less populated areas with branching pipelines.
The water distribution network model was built in WNTR using a combination of raw data and assumptions regarding connectivity and other model attributes. Pipe diameter, roughness, and length were obtained from the database. Pipe connectivity was determined based on the proximity between pipe end points. A junction was added at each pipe end point, and elevation was assigned using United States Geological Survey Digital Elevation Model (DEM) data [
35]. Junction demands were estimated using the total water treatment capacity and data on nearby building footprints. The elevation of each reservoir (water treatment plants) was assigned using the DEM data. Water pressure leaving the reservoir was set to 70 m. Tank height and diameter were obtained from the database. Some tank diameters were corrected using estimates from aerial photography. Tank elevations were assigned using the DEM data. Pumps were oriented such that they increase pressure as water travels away from the nearest reservoir. The flow capacity of the pumps was determined from the data, and headgain was set to 75 m. Since the model was intended to be used in steady-state simulations for this analysis, no demand patterns were added to the model. Given that the tank levels were unknown, the model was simulated for 48 h in an extended period steady-state simulation, and steady-state tank levels were assigned based on the time when the model system average pressure stabilized. While valve data existed, they were not added to the model due to uncertainty in their operations. The water distribution system in Mayagüez was known to have numerous pressure-reducing valves. Modeled pressures were generally higher than expected because those valves were not added to the model. Pertinent model attributes and the percentage of data that were available for each asset are also listed in
Appendix A. The resulting water distribution system model is shown in
Figure 1. The Mayagüez model includes 7496 pipes, 6287 junctions, 124 tanks, 89 pumps, and 6 reservoirs. The average system pressure from a steady-state simulation was 75 m.
2.2. Synthetic Network Generation
The method for generating synthetic water distribution systems was built on previous work. In particular, the use of the minimum cost flow algorithm for pipe diameters came from Ahmad et al. [
10], and the method of using a connectivity index for adding loops to a network came from Mair et al. [
17]. Similar to the approaches of both papers, the road network covering the water service area was used as a candidate network structure, a subset of which was selected to form the synthetic water network. However, the method presented here differs from these methods through the use of a minimum cost flow to determine both a minimal spanning tree as well as the pipe diameters of the edges. Additionally, a modification of the minimum cost flow optimization problem (MMCF) allowed for the incorporation of real pipe data (location and diameters) to drive network generation toward the real system. Once the minimal water network was generated using the MMCF, network redundancy was added using the connectivity index from Mair et al. [
17], and the rest of the water model was built by adding real reservoirs, tanks, and pumps from the utility dataset to the synthetic network via geospatial operations.
The following subsections further detail the methodology, including the data requirements and sources, steps involved in synthetic network generation, and a description of the synthetic networks generated to produce comparisons with the validation model.
2.2.1. Data Requirements and Sources
The minimum data requirements to produce a water distribution system model using this method were the service area boundary, road network, the location and magnitude of demand points, the location and capacity supplied from each reservoir, and elevation data within the service area. In situations where the location and magnitude of demand were unknown (i.e., when there were no utility data or validation model), demands could be approximated using the building footprint.
Data sources used in this experiment include OpenStreetMaps [
34] for road data, the Humanitarian Data Exchange [
36] for building data, USGS [
35] for elevation data, and the Puerto Rico infrastructure database [
28] for data related to the Mayagüez water distribution system (i.e., service area, demands, and reservoirs).
2.2.2. Initialization
The service area boundary was approximated by creating a polygon around the water distribution assets within the Mayagüez Water Service Area. Road data within the service area were obtained and converted to a topological network using the Python package OSMNX [
33].
Next, the water demand and supply were attached to the nodes of the road network. In this case, demand locations and magnitude were approximated from building footprints, using the same methods as those used to develop the validation model. The demand data were attached to the road network through a basic geospatial data processing technique known as “snapping”, which determines the nearest road node for each demand. If multiple demands attach to the same road node, their summed demand was assigned to the node.
Water treatment plants were added as new nodes and connected to the nearest node in the road network. To ensure adequate connectivity between reservoirs and surrounding roadways, additional roads were added to the road network in each cardinal direction (north, south, east, west) within a 1500 m radius surrounding the facility. These additional roadways were potential pathways that would only be assigned flow if needed by the minimum cost flow problem. This approach generalized previous reservoir attachment methods, which assume that reservoirs have a single connection to the network [
10,
17]. This assumption does not hold true in Mayagüez, and negatively impacts synthetic model performance. A supply value for the MMCF was assigned to each reservoir using the total capacity of the reservoir.
2.2.3. Incorporation of Pipe Data
In order to leverage the capabilities of the MMCF, there must be a source of pipe diameter assignments to road edges. There were multiple options for where the diameters can come from. For example, known pipe diameters and locations could come from a subject matter expert (e.g., utility engineer), numerical analysis, machine learning, or sparse datasets.
In this analysis, a sparse dataset of pipe diameters was sourced from the utility dataset. The known pipe diameters were transferred from the pipe network to the road segments via a geospatial data processing technique called spatial association. The objective of spatial association was to accurately assign pipe diameters from a dataset to road segments based on nearby pipes. If no pipes were associated with a road, the assigned diameter was 0, and if there was at least one pipe associated to a road the assigned diameter was equal to the diameter of the largest pipe. These values were input into the MMCF to inform how the network layout and diameters were chosen. The spatial association method expands on the infrastructure collocation work by Mair et al. [
21] and includes additional steps to account for potential road–pipe misalignment.
2.2.4. Modified Minimum Cost Flow
Minimum cost flow is an optimization problem designed to distribute a resource from sources to demands across a network. A candidate solution is one that finds flow values for each edge that balances the resource across source and demand nodes. In the standard formulation of minimum cost flow, a minimal spanning tree is a candidate solution that minimizes the sum of flows multiplied by corresponding weights.
Through solving the minimum cost flow on a road network, using sources and demands representing those of a water distribution system, a reasonable network for a water distribution system model can be obtained. Flow assignments to the roads can be converted to pipe diameters using the following formula:
where
d is diameter (m),
P is the peak demand multiplier (unitless),
f is flow (m
3/s), and
v is velocity (m/s), which was set to 1.52 m/s for all pipes following the assumptions made in [
10]. The peak demand multiplier was used to convert the average demand to peak demand to ensure the diameters were adjusted appropriately for high-flow periods. The value can be estimated empirically; however, this analysis used a value for
P that was calibrated to produce a network with am average pipe diameter similar to that of the validation network. A value of
was chosen so that the average length-weighted diameter of the synthetic network matches that of the validation network, which was approximately 0.152 m. After applying this conversion, diameters equal to zero can be dropped from the network to obtain a synthetic network with diameters assigned to each edge.
One of the central contributions of this work is a modification of the standard minimum cost flow that allows for the input of target diameters. Target diameters are diameter values for a subset of road edges representing desired diameters for the final pipe network. These diameters could reflect known diameters of the system (e.g., a system operator knows that there is a 0.2 m pipe running alongside a certain road) or could come from another source such as machine learning predictions of pipe diameters for the system. In this analysis, known pipe diameters were assigned to nearby roads using spatial association to provide target diameters. The assignments from spatial association provide probable predictions of what the pipe diameter should be on each edge.
The modification was a change to the optimization function to favor solutions with flows close to the provided target flow, which were obtained by applying the inverse of Equation (
1) to target diameters. For edges with a target flow, the objective function minimizes the difference between the target flow and the flow assigned to the edge. For edges without a target flow, the objective function minimizes flow along the edge, which was encoded by setting the target flow to zero.
Given a network with node set
and edge set
, the MMCF optimization problem is formally defined as follows:
where
is the set of edges in the network;
is the set of nodes in the network;
is the weight on edge ;
is the flow on edge ;
is the target flow on edge ;
is the net supply (positive) or demand (negative) at node ;
and are the lower and upper bounds on the flow for .
Equation (
2) defines the objective of the optimization problem, which minimized the difference between output flow and target flow if a target is provided and minimized the flow to zero if no target is provided. Equation (
3) conserves flow throughout the nodes of the network based on supply and demand values. Equation (
4) enforces lower and upper bounds for flow assignments on the edges. The lower bound was set to zero and the upper bound was set to the flow value corresponding to a 1.829 m pipe diameter to ensure that pipe diameters were not too large. Equation (
5) ensures that all assigned flows are positive.
The results produced flow values for each edge on a continuous scale that corresponded to continuous diameter values, as opposed to discrete values as is typically expected by engineering standards. In this analysis, the continuous values were not modified to match preset discrete values. Through dropping all edges that were assigned a flow value of zero, a minimal spanning tree that connects all demand nodes and reservoirs was obtained. The flow values were then converted to diameters using Equation (
1). The continuous nature of the MMCF means that diameters can be arbitrarily small, so a minimum diameter of 0.051 m was applied (after dropping flow values that were zero). The resulting network and diameter assignments were used as the foundation for the synthetic water distribution system model, which was completed by following the steps outlined in the next subsection.
2.2.5. Post-Processing
The solution from the MMCF was designed to be minimal. There were no loops and no redundancy in the connectivity of the network. Since water distribution systems typically have some level of redundancy, a technique using the connectivity index was used to increase the connectivity of the network [
22]. This process added additional paths along roadways if the existing path was much longer than the existing shortest path between its start and end node. This analysis added additional connections if the new path was at least 50% shorter.
Other major assets and data were added to the network to complete the model. Note that while model calibration was an important aspect of model development, calibration was not included in this analysis. The location and attributes of tanks and pumps were extracted from the validation models. Elevations were extracted from DEM data. The reservoir head was extracted from the validation model. Although the Puerto Rico datasets include valves, they were not used in this analysis due to a lack of certainty in their operation. Reservoirs and tanks were added by connecting them to the nearest junction in the network. Pumps were attached to the nearest pipe and oriented so that the pump faces away from the nearest reservoir. For more details, refer to
Appendix A, where similar methods were used to generate the validation model.
2.2.6. Generated Models
Three synthetic models for Mayagüez were generated using the framework established in the previous sections. Each synthetic network used a different amount of randomly selected pipe layout and diameter data provided by spatial association, which assigned a diameter to each road based on real pipe data, with a diameter of zero indicating no pipe. A network built using 0% of the pipe data was generated to represent a typical synthetic network that had no external knowledge of the real pipe layout. Two networks were built using 50% and 100% of the pipe data to represent cases with partial and full access to the real pipe data, respectively. It is important to note that while the network built with 100% utilizes all pipe data, the data had been transferred from the real water pipe network to the road network and therefore lost accuracy. These synthetic networks are referred to as “Synthetic X%”, where X is the percentage of pipe layout and diameter data that the network was built with. The validation network is referred to simply as “Validation”.
2.3. Performance Metrics
To understand how the synthetic models compare to the validation model, several performance criteria were selected. Global topological metrics were included to get a sense of the network structure and compare the results to the work performed by Paez et al. [
23]. The distribution of diameters and node pressure was included to identify differences in network attributes and hydraulics. Finally, a disruption scenario was used to track how synthetic networks mimic the behavior of the validation model under stress. The disruption scenario selected for this analysis approximates a failure scenario that has frequently occurred at one of the primary water treatment plants in Mayagüez [
37,
38]. In addition to evaluating the similarity of synthetic networks to the validation network, the goal of this diverse suite of metrics was to understand how model-related quantities changed through the incorporation of various amounts of known pipe data to the MMCF.
2.3.1. Topology
A collection of global topological metrics were used to compare the overall structure of the networks. While not an exhaustive list, the chosen metrics aimed to cover a breadth of network qualities. Metric choices were informed by the work of Paez et al. [
23]. To define these metrics, let
and
represent the nodes and edges of a network, and let the size of
be
n and the size of
be
m. Four topological metrics are given as follows [
39]:
Edge density
q is the fraction between the number of edges in a network and maximum amount of edges possible and measures the overall connectivity of the network. Larger values indicate higher connectivity. An average node degree
also captures connectivity but uses data from the nodes rather than edges. The meshedness coefficient
is the fraction between the number of independent loops in a network and the maximum number possible and captures the redundancy of the network. Larger values indicate higher redundancy. Spectral gap
is the difference between the first and second eigenvalues of the adjacency matrix for the network and measures the robustness of the network by determining whether a network has “good expansion” [
39]. Good-expansion networks are those that have a high minimum number of components that must be removed before creating a disconnected network. While spectral gap is an abstract, unitless measurement, it can be used to compare networks where a larger value indicates a more robust network. Authors Paez et al. [
23] called out spectral gap as uniquely useful for identifying differences in networks where other metrics indicate similarity. Together, these metrics summarize fundamental topological dimensions of water networks, facilitating a broad topological comparison across the synthetic and validation network structures.
2.3.2. Diameter
To evaluate the impact of including known pipe diameters in the MMCF, the diameter assignments and distributions of the synthetic networks were compared to those of the validation networks. A direct comparison of diameter assignments cannot be made between a synthetic network and the validation network since they exist on different topologies. Instead, diameter assignments of synthetic networks were evaluated against the spatial association results, which do share the same topological structure as the synthetic networks. The chosen metric was the length-weighted mean absolute diameter error (D-MAE) and is computed as the mean over the absolute value of each diameter error, which is the difference between the synthetically assigned diameter and the spatially associated diameter, and weighted by the edges’ proportion of the total network length. Because the discretization of pipe segments differs between the synthetic networks and the validation network, weighting by length was used to normalize the error rather than treating all errors with equal weight. Let
be the set of edges in the road network data for the service area, let
be the fraction of total network length for edge
e, let
be the spatially associated diameter assigned to edge
e, and let
be the synthetic diameter assigned to edge
e, and then the D-MAE is computed via
The D-MAE retained the units of the input variables and was useful for its straightforward interpretation; for example, a D-MAE of 1 indicates that, on average, a pipe in the synthetic network deviates 1 m from the spatially associated diameter. High values for the D-MAE indicate a large mismatch between the diameter assignments of the given synthetic network and those captured by spatial association, whereas low values indicate similarity between synthetic diameters and spatial association diameters.
The following set of diameter bin centroids were used to generate diameter distributions: 0.051, 0.102, 0.152, 0.203, 0.254, 0.305, 0.356, 0.406, and 0.457 m. This binning scheme was designed to separate the major diameter sizes seen in Mayagüez while also remaining compatible with the continuous nature of the MMCF outputs. Instead of reporting on total edge counts for each bin, the percentage of network length was computed to account for the varying discretization of pipes across networks.
2.3.3. Pressure
Steady-state pressures were simulated and compared to form an understanding of the hydraulic differences of the models during normal conditions. Initial tank levels were highly influential in these simulations; however, these levels were uncalibrated in the validation model for Mayagüez since it was built from basic data. To obtain reasonable initial values, a 96 h simulation starting with all tanks being 90% full was run for the validation model, and tank levels were extracted from a time point where the system pressure had stabilized. These individual tank levels were then applied to the validation and synthetic models.
The synthetic models have distinct network structures from that of the validation model, so there cannot be a direct comparison at the node level. Given this restraint, a comparison of pressure was generated by averaging nodal pressures within each census block group. While some resolution in the pressure results was lost when aggregating to census block groups, general spatial patterns of pressure were still able to be captured. It is important to note that pressure-dependent demand simulations using EPANET can still result in junctions with negative pressure. Since negative pressure has no physical meaning beyond the absence of pressure, the negative pressures were converted to zero before averaging the nodal pressures within each census block group.
To quantitatively compare pressure between models, the mean absolute error (MAE) was used to summarize differences in the census block group pressure. MAE is defined as follows, where
i indexes over the census block groups,
is the
census block group pressure in one of the synthetic networks, and
is the
census block group pressure for the validation network:
2.4. Response to Disruption
A disruption scenario was developed based on recent events where water service from the Miradero water treatment plant was temporarily disabled. The Miradero water treatment plant in Mayagüez, Puerto Rico, has experienced multiple breakdowns in its raw water supply line, causing significant water shortages in the area. In April 2023, raw water intake from the Río Grande de Añasco to the Miradero plant required emergency replacement [
37], and similarly, in late December 2023, the pumping system for the same plant experienced a breakdown over multiple days, preventing the plant from distributing drinking water to many residents and critical locations such as hospitals [
38]. After repairs began on the pumping system, additional leaks on a 0.1524 m pipe were identified, which extended the period of repair. The event was declared a state of emergency by the mayor to supply water to residents via trucks.
This disruptive event was modeled by setting the base head of the Miradero reservoir to zero meters in the model and simulating for 48 h using the same initial conditions as the steady-state simulations in
Section 2.3.3. While the models were not fully calibrated for long-term simulation (lacking demand patterns, valves, and controls), these details were consistent across the validation and synthetic models. Nodal pressure drops between the baseline and failure scenarios are plotted at the last time step (48th hour) to reveal spatial patterns that do or do not emerge across the synthetic and validation models.
3. Results
Throughout the following sections, the synthetic networks are compared to the validation network across multiple performance metrics. These results seek to answer two questions: (1) how are the synthetic models different or similar to the validation model and (2) how do the performance criteria change as different amounts of pipe data are provided to the MMCF algorithm. In
Section 3.1, four global topological metrics are evaluated across the networks. In
Section 3.2, diameters assigned to pipes are compared. In
Section 3.3, steady-state nodal pressures are compared. In
Section 3.4, the pressure response to the same disruptive event is compared across the networks.
3.1. Topology Comparison
The four topological metrics described in
Section 2.3.1 were evaluated for the three synthetic networks and the validation network. The results are presented in
Table 1. The metrics include edge density, average node degree, meshedness coefficient, and spectral gap. While there are countless metrics to choose from, the chosen metrics cover a range of network qualities, as laid out by Yazdani et al. [
39]. In addition to the computed topological metrics, the number of nodes and edges, as well as total network length for each network, are included.
The three synthetic networks generally have one to two thousand more edges and nodes than the validation network; however, the validation network has a longer total network length than the synthetic networks. The difference in edge and node count can be attributed to differences in the discretization of the water pipe data (longer line segments) versus the road data (shorter line segments). When pipe diameter data are included in synthetic network generation in the case of the Synthetic-50% and Synthetic-100% networks, additional target flows are added to the MMCF, which results in more edges being included in the network diameter since edges with a target flow are encouraged to have positive values, instead of being minimized to zero.
By design, the MMCF will minimize flow on edges where no pipe data are present, so the outputs with no or partial data (Synthetic 0% and Synthetic 50%) tend to have a shorter total length than the networks built with all of the data (Validation and Synthetic 100%).
Edge density measures the overall connectivity of a network, and its value can be directly interpreted as the percentage of edges in the network out of the total possible number of edges given the node set. The validation network has a very low edge density, which is matched closely by the synthetic networks. There is little difference among the synthetic networks, indicating that the inclusion of pipe data does not affect the overall connectivity of the networks. Average node degree gives an alternate view on the overall connectivity of the networks. All four networks have an average node degree in the range 2.3 to 2.4, and as the percentage of pipe data increases, there is a slight increase in average node degree. The meshedness coefficient measures network redundancy and is similar to edge density, but instead of measuring the percentage of total possible edges, it measures the percentage of total possible loops. The synthetic networks have similar meshedness coefficient percentages to the validation network; Synthetic 0% has a slightly lower value, Synthetic 50% closely matches the validation network, and Synthetic 100% has a slightly higher value. The overall similarity to the validation network indicated that the choice of 50% for the connectivity index threshold when adding loops to the networks was appropriate. Spectral gap is an abstract metric extracted from the eigendecomposition of the adjacency matrix and is known as a measurement of network robustness. The spectral gaps of the synthetic networks were all significantly higher than that of the validation network, indicating that the synthetic networks were more robust. One interpretation of spectral gap is that low values indicate the presence of edges or nodes that would disrupt network flow if removed, while higher values indicate that many nodes and edges could be removed before network flow is majorly effected [
39].
3.2. Diameter Comparison
All four models share a similar overall network layout, indicating the effectiveness of using the road network for synthetic pipe network generation (
Figure 2). Despite these large-scale similarities, there are details that indicate significant nuances. When focusing on large-diameter pipes (those plotted in red), there are two areas in particular that vary across the synthetic networks, which are captured in
Figure 2 with black rectangles. The large-diameter pipes in the western region and the northeast region are laid out very differently between the validation network and the 0% network. However, the 50% and 100% networks appear to replicate the structure of the large-diameter pipes found in the validation network much better.
The distribution of diameters in the synthetic models was distinct from that of the validation model; however, the addition of more pipe data influenced the synthetic pipe diameter distribution toward that of the validation model (
Figure 3). In the synthetic network, the majority of network length is covered by small pipes (between 0.0127 and 0.0635 m), and the network length covered by a diameter bin decreases as the diameter size increases. This pattern mostly holds true for the validation network; however, it is the second smallest diameter bin (between 0.0635 and 0.1143 m) that contains the majority of network length. The synthetic networks built with more pipe data have distributions that more closely match that of the validation network, particularly in the first three bins. Synthetic 0% clearly favored small diameters, which is the natural conclusion of an algorithm driven exclusively by minimizing flows. On the other hand, the Synthetic 50% and Synthetic 100% networks favor the smallest diameter bin less and assign larger diameters more often, reflecting the incorporation of information about the real network diameters. There is less of an improvement in the larger-diameter bins, which reported much noisier results as data were added.
The length-weighted mean absolute error (D-MAE) is computed between the diameter assignments in each synthetic network and the diameter assignments from spatial association. When pipe layout information was provided to the MMCF, there was a notable decrease in the D-MAE of the synthetic model diameters on roads against those assigned by spatial association, which confirmed that the MMCF was incorporating information about the network correctly. The D-MAE for Synthetic 0% was 0.0983 m, the D-MAE for Synthetic 50% was 0.0711 m, and the D-MAE for Synthetic 100% was 0.0575 m.
3.3. Pressure Comparison
Nodal pressures from steady-state simulation are shown in
Figure 4. In general, pressures are very high across all models, resulting from the lack of pressure-reducing valves. The models display a large amount of heterogeneity both internally and when compared to each other. In the validation network, the city center along the west coast tends to have large pressures with pressure decreasing in all directions. However, the synthetic networks maintain high pressure where the validation network does not to the north and south of the city center.
To describe pressure more quantitatively, nodal pressures in each network are averaged within census block groups (CBGs), and the CBG pressures of the synthetic networks are plotted against the CBG pressures of the validation network in
Figure 5. The CBG pressures for each network roughly follow a linear relationship with the CBG pressures of the validation network, indicating that CBGs with low pressure in the validation generally have a low pressure in the synthetic networks and likewise for high pressures. There is still a considerable amount of deviation from the validation pressures present in the synthetic models, which can be summarized by computing the MAE between average CBG pressures for the validation network and each synthetic network. The CBG pressure MAE for Synthetic 0% is 31.6 m, the MAE for Synthetic 50% is 23.5 m, and the MAE for Synthetic 100% is 23.2 m.
3.4. Response to Disruption Comparison
The pressure response to the disruptive scenario of each synthetic network matched the spatial patterns of the pressure response to the same scenario in the validation network,
Figure 6. The average nodal pressure drops across each model are as follows: 32.5 m for validation, 64.0 m for Synthetic 0%, 21.0 m for Synthetic 50%, and 15.5 m for Synthetic 100%. The pressure drop for the Synthetic 0% model is much more severe than the other models, and the synthetic models with additional pipe data experience pressure drops smaller than the pressure drop in the validation model. The validation network shows that the area affected by the disruption is primarily in the downtown area surrounding the water treatment plant and the region of the network directly south (
Figure 6a). This spatial pattern is broadly captured by the synthetic networks, with some notable differences occurring in the northern region of the network, where the synthetic networks model a drop in pressure that is not present in the validation network. In the affected region, Synthetic 0% (
Figure 6b) has extreme pressure drops of up to 100 m, while Synthetic 50% (
Figure 6c) and Synthetic 100% (
Figure 6d) have moderate pressure drops of around 40 m, matching what is seen in the validation network more closely.