1. Introduction
Regular bridge inspections are vital for assessing bridge structures. They provide detailed records of bridge conditions. Critical bridge monitoring offers additional regional insights [
1]. However, the current system uses a ‘one bridge, one file’ approach. This means data from each bridge are used only for its own assessment and repair. This method overlooks possible similar patterns of deterioration in different bridges. It also increases the complexity and cost of managing bridges. Bridges in a regional network are interconnected, not isolated [
2,
3]. Over the years, a vast amount of data has been collected on these bridges. It is important to integrate and standardize these data into a road network bridge database. Such integration would enhance the maintenance of the bridge network. It could also help predict future structural conditions within the network [
4,
5].
In recent years, global scholars have focused on regional bridge management and utilization. Precise evaluation is key to effective bridge management. To this end, governments worldwide have developed various codes for bridge assessment. Italy, for example, has guidelines for risk classification and bridge safety assessment [
6]. Similarly, other regulations include the “Specifications for Maintenance of Highway Bridges and Culverts” [
7] and China’s “Standards for Technical Condition Evaluation of Highway Bridges” [
8]. Beyond traditional methods, novel machine-learning algorithms are now being used. These algorithms detect defects and compute health indices, offering faster and more accurate evaluations [
9,
10,
11,
12,
13]. A well-structured Bridge Management System (BMS) is essential for storing and using these assessment outcomes effectively.
Bridges are managed at two levels: bridge level and network level. In this study, they are referred to as “single bridge level” and “network level.” Reliance on a single management model is inefficient, especially for small and medium span bridges. These models often treat bridges as isolated entities and use disconnected inspection methods [
14]. Regional bridges share attributes, making network-level evaluation and management essential.
Globally, Bridge Management Systems (BMS) are the main methods for managing multiple bridges. The National Bridge Inventory (NBI) database [
15] created by the Federal Highway Administration (FHWA) in 1968, was the precursor to modern BMS. Originally developed for data management, the NBI has evolved to include evaluation, prediction, and decision making due to increased maintenance needs and limited funding [
16]. The UK, Japan, and Denmark have similar systems like NATS [
17], J-BMS [
18], and DANBRO [
19]. China’s development in this area has been slower, mainly using the China Highway and Bridge Management System (CBMS) and Shanghai’s bridge management system.
The Bridge Management System (BMS) has greatly improved bridge management. Yet, it has imperfections and practical applications reveal deficiencies. A key BMS function is recognizing bridge deterioration patterns. This transforms raw data into domain-specific knowledge, crucial for network-level evaluation [
20]. As described in reference [
21], this involves linking structural condition to independent variables.
There are two main approaches to this link: historical health data and probabilistic reliability models (PRM) [
22]. The first predicts future conditions using past deterioration data. The second forecasts deterioration by analyzing time-varying factors like corrosion. PRMs have been successful in both stationary and non-stationary process analysis [
23,
24,
25,
26]. However, PRM is better suited to univariate cases. Multivariate scenarios pose challenges due to complexity and computational demands. In contrast, historical health data approaches handle multivariate situations well [
27,
28]. They use extensive historical data to track deterioration over a bridge’s life. The accuracy of these models depends on the amount of available data. Health monitoring systems are costly; therefore, inspection reports are often the primary data source. However, the quality of these reports can vary, making data cleaning and integration crucial for future analyses.
To this date, few studies have comprehensively addressed the form, acquisition methods, integration rules, and information data structure regarding the multi-source information pertaining to regional bridges. The demands of network-level evaluation underscore the need for significant improvements in many aspects.
This paper details integration rules and a data structure for a regional bridge information database, designed to maximize the use of extensive multi-source data. It is divided into three main sections:
Section 2 describes the framework for information integration, covering data sources, key features, expression formats, and rules.
Section 3 discusses methods for extracting and cleaning data from inspection reports, the primary data source for regional bridges. Finally,
Section 4 provides a practical example of this integration, using data from 812 highway bridges in Shandong Province.
2. Information Integration for Regional Bridges
2.1. Inherent Characteristics of Information Integration
The sources and integration of bridge group information for network-level assessments exhibit inherent characteristics:
Hierarchical decomposition: Both bridges and road networks are intricate systems composed of various substructures. The database should implement a multi-tiered decomposition and traceability of the road network, bridges, and components, ensuring the feasibility of retrieval between different levels, thus providing the data logic foundation for network-level assessments.
Interactions: The deterioration of individual bridge components mutually influences one another. For instance, cracks in the bridge deck can expedite the deterioration of primary beams, while damaged bearings can lead to deck impairments. Consequently, the database should encompass the interactions among these components to unveil the inherent connections between structural performance and condition changes.
Universality and scalability: The database should be designed to meet the current analytical requirements while ensuring its capacity for future expansion and updates, accommodating the inclusion of new data or integration of novel analysis methodologies.
Representation of time-varying data: Bridge data are commonly categorized into static data, such as span length and materials, and dynamic time-varying data, such as bridge age and traffic volume. Dynamic data are often associated with changes in structural characteristics. Therefore, the database should support dynamic representation to elucidate the spatiotemporal relationships within the data.
2.2. Data Source
The inspection data serve as the paramount repository of bridge-related information within a region. Thus, the inspection report stands as the most immediate and relatively comprehensive historical record of these data, typically archived in the format of a technical condition assessment sheet. A typical assessment sheet comprises three fundamental components:
The fundamental parameters of bridges encompass key fields, including the affiliated road section, bridge pier location, bridge length, primary span configuration, span length, date of construction, and inspection date, among others.
Component-level assessment data, including superstructure, substructure, and deck system, along with corresponding ratings.
Overall bridge rating and maintenance records, including the overall assessment of the bridge’s condition and its maintenance records.
The described data collectively characterize crucial information about the bridge, such as its structural features, service life, spatial distribution, environmental influences, and condition assessments. These insights are of paramount significance in understanding the deterioration process of the inspected structure.
Simultaneously, the transportation network serves the functions of passage and cargo transportation. From the perspective of structural loading during the service life of bridges, traffic loads constitute the most prominent external influence. Given that the inspection reports do not record traffic flow information for the specific road segment where the inspected bridge is situated, it becomes imperative to incorporate additional data sources. To address this, the collection of annual average daily traffic volume data from various traffic monitoring stations along the highway corridor is recommended. These data can be selected based on the equivalent passenger car unit, which effectively characterizes the average traffic flow impact on the inspected bridge. Alternatively, dynamic weighing systems or video monitoring can be employed to obtain information regarding the traffic flow of the network. Notably, within vehicle classification statistics, the quantity and proportion of heavy-duty vehicles have a more pronounced impact on the structure and thus, warrant special attention.
Furthermore, the design and construction blueprints of the bridge can supplement and validate the fundamental information within the database, including structural design parameters, lane quantities, and other pertinent details.
2.3. Multi-Source Data Logic Expression
In the case of the aforementioned integrated multi-source data, it is essential to store and represent this information in an “attribute” format. An attribute comprises an attribute name, referring to a specific characteristic, and an attribute value, which represents the data associated with that characteristic. Different attributes may have various data formats, such as numerical, ordinal, or nominal. Among these, numerical attributes offer the most intuitive and quantifiable means for comparison. Conversely, other formats involve qualitative expressions. Ordinal attributes, although meaningful for comparison, may pose challenges in precisely quantifying differences between successive values. Nominal attributes, linked to specific names or categories, are solely used for categorical representation. In the context of a road network database, typical attributes and their format definitions are as follows:
Region name, route code, bridge code, component code, and bridge pier number are all nominal attributes, signifying the spatial distribution of bridge entities. The divisions of the region should align with the specific road network characteristics, taking into consideration geographical locations and temperature–humidity patterns.
Year of construction, inspection date, and bridge age are all numerical attributes, representing temporal information about the bridge. Bridge age can be derived by subtracting the year of construction from the inspection date.
Bridge type, bridge length, maximum span, cross-sectional dimensions, design code, and design office: Bridge type, design code, and design office are nominal attributes, while the others are numerical attributes, indicating the structural characteristics of the bridge entity.
Annual Average Daily Traffic (ADT) and Annual Average Daily Truck Traffic (ADTT) are both numerical attributes, denoting the traffic flow the bridge accommodates. ADT is calculated as the total equivalent passenger car units, while ADTT is the sum of heavy-duty trucks and large passenger vehicles. Because ADTT primarily reflects the heavy loads that significantly contribute to the deterioration of bridges, it is chosen to represent the load information of bridges in the multi-source information database.
Highway classification, design load, roadbed width, and number of lanes: Highway classification and design load are ordinal attributes, while the rest are numerical attributes, illustrating the bridge’s traffic capacity.
Overall score, overall rating, superstructure rating, substructure rating, deck rating, and ratings for various component technical conditions, as well as the deflection history of the bridge: The scores and the deflection history are numerical attributes, while the ratings are ordinal, signifying the structural state of the bridge.
Among these attributes, the majority are static in nature. However, inspection date, bridge age, ADT, ADTT, structural ratings, and scores are dynamic data. These dynamic data elements are organized as key-value pairs, associated with a temporal dimension. Each key represents a corresponding time coordinate, and the value corresponds to the attribute’s value at that coordinate. It is essential to store them in accordance with their respective formats. For instance, for a bridge with a “year of construction = 2015”, the data could include “bridge age = {2016:1, 2017:2}” and “overall score = {2016:99, 2017:98}”.
2.4. Data Integration
The specific organization and data structure for information at different hierarchical levels within the region exhibit distinct characteristics. Drawing upon the principles of object-oriented programming, we introduce the concepts of “classes” and “instances.”
A class abstracts specific entities, expressing a particular concept. Within the regional bridge network, there are three levels of classes: component class, bridge class, and route class. These levels exhibit a hierarchical relationship, with, for example, the route class encompassing bridge classes and bridge classes containing numerous component classes. Each class can be distinguished by its name, such as the “primary beam” class and “bearing” class, each storing distinct information. The attributes defined earlier are systematically integrated into these classes based on their applicable objects and scope, creating templates for various components, bridges, and routes.
Correspondingly, instances serve as the carriers of entities within the database. They are created based on the templates specified by the classes, representing real-world objects, such as component instances, bridge instances, and route instances, and are populated with attribute data according to their characteristics and properties. For example, Bridge A and Bridge B are both created based on the “beam bridge” category, but they are independent bridge instances, with the former having a bridge length attribute value of 15 m and the latter having a length of 30 m.
Figure 1 summarizes the integration rules for the aforementioned dataset.
However, the integrated results cannot be directly employed for subsequent analysis and processing due to disparities in the original data quality from various sources. These disparities manifest as follows: (1) Physical records, with limited preservation periods, are susceptible to issues like loss and damage, affecting data continuity and accuracy. (2) Electronic records, influenced by varying management levels and execution practices, result in challenges in unifying formats and standards. (3) Errors during data entry are inevitable, leading to issues like missing or inconsistent information. Therefore, it is imperative to clean the integrated data to eliminate noise and facilitate subsequent pattern recognition.
3. Inspection Data Extraction and Cleaning Methods
3.1. Storage Format of Existing Inspection Reports
Over the past two decades, bridge inspection in China has been governed by two key standards: “Specifications for Maintenance of Highway Bridges and Culverts” (JTG H11-2004) [
7] and “Standards for Technical Condition Evaluation of Highway Bridges” (JTGT H21-2011) [
8]. Prior to the introduction of JTGT H21-2011 in 2011, JTG H11-2004 guided bridge inspection practices, with JTGT H21-2011 assuming this role after 2011.
JTG H11-2004 and JTGT H21-2011 employ differing approaches to bridge assessment. Both standards mandate defect inspections of bridge components. However, JTGT H21-2011 goes a step further by assigning structural condition scores to the inspected components, a requirement not present in JTG H11-2004. Disparities also emerge in calculating the overall bridge structural condition score. In the context of JTG H11-2004, this score is determined using Equation (1):
In Equation (1),
represents the overall bridge structural condition score,
signifies the extent of defect in a component, and
stands for the weight of the respective component. In contrast, within JTGT H21-2011, the overall bridge structural condition score is computed as the weighted average of component structural condition scores, as demonstrated in Equation (2):
In Equation (2),
denotes the overall bridge structural condition score,
represents the structural condition score of a component, and
signifies the weight of the corresponding component.
These differing approaches to bridge assessment engender variations in the storage format of inspection reports. Reports adhering to JTG H11-2004 and JTGT H21-2011 conventions contain inspection data stored within structural condition evaluation forms. However, as depicted in
Figure 2, the structure of these forms diverges significantly. In the figure, the slash “/” represents a place where information does not need to be filled in. This happens when a bridge does not have the listed component in the table.
The divergence observed in inspection reports guided by distinct standards necessitates tailored extraction methods. Furthermore, it underscores the imperative of conducting thorough data cleaning processes.
3.2. Data Extraction and Storage
Given that inspection information resides within structural condition evaluation forms within inspection reports, the fundamental approach to data extraction involves systematically traversing all forms within the inspection report document and extracting information from forms designated as structural condition evaluation forms.
The criteria for identifying these forms can vary based on the formatting of the documents. In this study, the number of columns and rows served as criteria for judgment, with content in specific table cells also considered, such as cells containing labels like “Disease name,” which often signifies the presence of disease-related information.
Multiple strategies were proposed for storing the extracted data. Employing a 2D Table Structure proves beneficial for efficient data storage. While an SQL database offers more advanced data processing and querying capabilities, for the sake of accessibility and ease of use, a 2D table structure was adopted for data storage.
The inspection data, originally structured as structural condition evaluation forms in inspection reports, assumes a table format. During data storage, a structural condition evaluation form for a specific bridge is compressed to create a corresponding row of data. Consequently, within the ultimate data storage table, each row signifies the inspection data for an individual bridge, while columns represent distinct inspection attributes for that bridge. Data pertaining to different years are segregated into separate tables.
The design of the final 2D table structure for data storage was crafted with ease of processing in mind, facilitating seamless reading and manipulation by Python’s DataFrame module. Nonetheless, even users without programming expertise can interact with the data storage table using common office software like Excel 2016.
3.3. Data Cleaning
There are mainly two types of errors: record errors and mismatched data.
3.3.1. Record Errors Cleaning
Record errors typically stem from clerical mistakes made by maintenance staff when composing inspection reports. One common type of record error involves typographical inconsistencies, such as the structural condition score “94.6” being mistakenly entered as “94..6,” which includes a duplication of decimal points and other similar errors.
To rectify such errors, we leverage the regular expression string matching method, which enables the identification and correction of these inaccuracies, ensuring the integrity of the actual information.
Another prevalent type of record error is known as the “Same thing, different names” error. This error arises when diverse expressions are utilized to denote identical elements. For instance, while previous inspection reports might refer to cracks as “Crack,” a particular inspection report might term some of these cracks as “Rift.” To address this issue, we establish dictionaries containing various expressions used to describe identical items across inspection reports. During the extraction process, we harmonize these expressions by standardizing them using the names stored within the dictionaries. This approach mitigates confusion and guarantees consistency across the data.
3.3.2. Mismatched Data Cleaning
The term “mismatched data” refers to instances where inspection data adhering to two distinct standards, namely JTG H11-2004 and JTGT H21-2011, do not align. In general, inspection reports following JTG H11-2004 lack component structural condition scores, which are indispensable for constructing the deterioration model of regional bridges. To address this discrepancy and ensure data uniformity, we establish conversion rules to derive structural condition scores for JTG H11-2004-guided inspection reports using available data.
When dealing with data under JTG H11-2004, the recorded information pertains to the degree of defect in different components, rather than structural condition scores. Thus, the initial task involves generating structural condition scores from the degree of defect under the JTG H11-2004 scoring system. Referring to Equation (1), the degree of defect corresponds to points deducted from the bridge’s structural condition score. As the sum of all component weights in Equation (1) equals 100 (as depicted in Equation (3)), the component weights must take the value of 100 to ensure consistency when calculating the component’s structural condition score (as illustrated in Equation (4)).
In Equation (4),
denotes the component’s structural condition score, while
stands for the degree of defect in the corresponding component. Consequently, one unit of the degree of defect translates to a deduction of 20 points from the component’s structural condition score.
Subsequently, converting the structural condition score from the JTG H11-2004 and JTGT H21-2011 scoring system becomes necessary. Since the criteria for categorizing structural conditions remain uniform across both systems, establishing conversion rules based on equivalent structural condition levels is reasonable. These levels are discrete functions of the structural condition score, as outlined in
Table 1.
Let
symbolize this function. However, the discreteness of
introduces inaccuracy in describing the actual structural condition. For example, components with scores of 94.9 and 95.1 under the JTGT H21-2011 scoring system would fall into structural condition levels of class 2 and class 1, respectively, even though their actual structural conditions exhibit minimal differences. In light of this, a new function
, was introduced, creating a continuous structural condition level.
was devised as a polyline function, interpolating linearly between the left points of each interval in
Table 1. The construction of
under both scoring systems is depicted in
Figure 3. In the figure, the red points stand for the interval endpoint as shown in
Table 1.
The creation of
lays the foundation for score conversion between the two systems via interpolation. The conversion process involves calculating the continuous structural condition level of a score under the JTGT H21-2011 scoring system using
. Subsequently, the score under the JTG H11-2004 scoring system is determined based on the calculated structural condition level using
. For instance, if a score of 70 under the JTGT H21-2011 scoring system corresponds to a structural condition level of 1.643, then
yields 85.357 as the converted value under the JTG H11-2004 scoring system. An example of this process is illustrated in
Figure 4. In the figure, the red points still represent the interval end points, and the green points stand for the conversion example introduced above.
The efficacy of these conversion rules is measured by the deviation before and after score conversion, quantified using the coefficient of deviation (
Err) as defined in Equation (5).
In Equation (5),
represents the structural condition level corresponding to the structural condition score of the bridge before conversion, and
represents the structural condition level corresponding to the structural condition score of the bridge after conversion. This coefficient evaluates the reasonability of the conversion rules, with the normalized absolute deviation of the structural condition level before and after conversion indicating their effectiveness.
is employed to assess the rationality of the conversion rule in the subsequent chapter.
4. Case Study
In this study, a dataset comprising inspection reports and design drawings of 812 main-highway bridges from three highway routes in Shandong Province from 2010 to 2023 was collected, as shown in
Table 2.
The data underwent extraction, cleaning, and integration processes, leading to the formation of a comprehensive regional bridge information database, as illustrated in
Table 3.
Initially, inspection data were extracted and cleaned from the inspection report documents stored in WORD format. The scores and ratings of structural conditions for both bridges and their individual components were extracted, cleaned, and subsequently stored, as detailed in
Table 4.
To address mismatched data errors present in inspection reports, the cleaning method described earlier was employed. Following cleaning, the coefficient of deviation was calculated, as depicted in
Figure 5.
Figure 5 showcases the distribution of deviation coefficients defined in Equation (5) after the conversion of different bridges. The distribution reveals that most bridges exhibit small deviation coefficients, with nearly 95% having coefficients within 0.05. This distribution further substantiates the rationality of the cleaning conversion rule proposed earlier.
This study also delves into exploring and analyzing various useful statistical features of attributes within the bridge network database. For instance,
Figure 6 visually represents the distribution of span lengths among typical bridges in the region. The observation deduced from this distribution is that the majority of bridges with a span length of 40 m or less fall under the category of small and medium span bridges, which often share similar structural properties.
Table 5 offers insight into the distribution of bridge types within the regional bridge dataset. Beam bridges emerge as the dominant structure type among the region’s bridges, particularly prevalent for small and medium span bridges. Furthermore,
Figure 7 illustrates the prevalent section forms for beam bridges. Among the investigated bridges in Shandong Province, box girder and slab girder section forms are the most frequently used, aligning with the span distribution data depicted in
Figure 6.
Given the extensive historical structural health data that have been integrated, it holds significance to delve into the evolving patterns of structural conditions for regional bridges over time.
Figure 8 serves to illustrate the distribution of structural condition scores for bridges across different years. As the age of bridges increases, the distribution of structural condition scores gradually shifts to the left, indicating an ongoing deterioration trend observed in regional bridges.
Figure 9, on the other hand, offers insight into the distribution of structural condition levels for bridges within a specific region. This distribution further underscores the consistent deterioration trend, evident as the proportion of bridges classified under level 1 decreases with increasing bridge age. It is notable that due to regular maintenance efforts, bridges with structural condition levels below level 2 are quite rare—a trend observed across most bridges in Shandong Province.
5. Discussion
This study presents a pioneering approach to the extraction, cleaning, and integration of data, culminating in the creation of a comprehensive regional bridge database. A key innovation is the processing of unstructured inspection data from reports, a historically challenging area in structural condition assessment. Unlike previous research, which has mostly focused on road pavements [
29,
30], this study targets bridge inspection data, implementing systematic procedures to address prevalent errors such as “record errors” and “mismatched data.” This enhances the reliability and precision of the inspection data.
Incorporating multi-source data, including inspection records, traffic statistics, and design and construction details, has proven effective in our case study. This integrated approach has the potential to transform bridge condition assessment, providing a solid base for informed decision making and strategic planning. The case study emphasizes the importance of meticulous data cleaning and integration for regional bridges, which is crucial for successful future data mining initiatives. This study addresses a research gap by detailing the form, acquisition methods, integration rules, and information data structure for multi-source regional bridge data.
This study can have worldwide benefits for the following reasons. First, although the conversion rules of rating scores were validated using data from Shandong province, its core approach—identifying intersections and invariants between different protocols—holds universal value. This principle is significant for an international audience, offering a systematic methodology adaptable to bridge assessment standards worldwide and aligning with evolving standards.
Beyond just score conversion, the overall vision of this paper encompasses establishing a comprehensive multi-source information database for regional bridges. The outlined methods for data extraction, cleaning, and integration are globally applicable. For instance, the selected attributes for the regional bridge database are relevant to most bridges worldwide. The identification of Record Errors in inspection reports and the proposed cleaning methods are universally applicable, providing a valuable toolkit for bridge assessment professionals globally.
In conclusion, while this study’s immediate application is within the Chinese context, the strategic approach and data management techniques presented have broad relevance. They offer valuable insights and reference points for bridge condition assessment protocols in other regions of China and countries around the world.
However, this study is not without its limitations. The current extraction method primarily focuses on tabular data within inspection reports and does not account for human-generated errors or data manipulation within these reports. Future research could enhance the robustness of the method by incorporating Natural Language Processing (NLP) technologies to interpret and comprehend inspection reports more accurately. Additionally, efforts could be directed toward assessing the authenticity of inspection data and exploring various avenues for utilizing the data stored in the established database, such as optimizing maintenance strategies for regional bridges.