3.1. Relate Definition
A key step in data fusion is record matching. It matches different records from different data sources that point to the same entity. Reference [
26] records three main methods of record matching: content-based matching, structure-based matching, mixed matching. We discuss the content-based matching based on data fusion. Some similarity algorithms are used to match the values of one or more patterns recorded. The semantic similarity metric is a function that describes the degree to which two concepts are similar. To determine whether two entities are same, we need to measure the similarity [
27]. Similarly, for structural data faultage generated in data fusion, we use similarity to process for data comparison.
The heterogeneity between data is an obstacle to the successful fusion of data. Determining the relationship between two data is an effective way to solve this problem [
28]. The feature vectors of a data column describe the attributes of various aspects in a data column that are used to determine the degree of difference between data columns. According to the formal description of data column feature vector, the difference degree of each component is firstly defined and investigated. Then combine them for the difference between the data columns. Each data column has an attribute, so we summarize the difference between the data columns as the difference between the attributes.
From the perspective of attribute similarity, this paper will indicate the solution when there are structural data faultages in data resource—Structural Data Faultage Processing Algorithm based on Attribute Similarity (SDFPAAS). Because the data faultage solved by this algorithm is analyzed from the perspective of structure, which involves the discussion of attributes. First, we introduce several concepts involved in the algorithm in order to describe the algorithm.
Definition 5. Attribute Similarity. Similarity between two attributesin one dataset or between an attributein one dataset and an attributein another dataset is called Attribute Similarity.
The sets of data waiting to be processed may have differences in individual attribute definitions, attribute ranges, attribute value types and so on, because of the various of storage system and medium. In such situation, extra analyzation and processing are required. The degree of retention of attribute information in source data set after data resource integration have direct connection with data resource’s quality after the integration [
29].
Attribute Similarity needs to comprehensively considerate the name of attribute, the description of attribute, the type of values, the distribution of values, value ranges and etc. The accurate calculation towards attribute similarity is the basement to achieve structural data faultage processing, and also the premise to analyze and decide the following data.
(a) Similarity between the name of attribute column
Attribute columns with the same name definitely describe the same semantics. Attribute column name similarity analysis is also called string similarity analysis. The method of string similarity comparison is usually used for calculating the similarity between attribute names
. Now, there are three main ways to calculate string similarity: literal similarity matching, semantic similarity matching and correlation statistical matching [
30]. The representative methods in literal similarity matching are calculation method based on edit distance [
31] and method based on the same character and words [
32]. We calculate the similarity between attribute names using the method based on edit distance.
The reconfiguration cost in Definition 3 is edit distance. It represents the minimum numbers of insert, delete, and substitution that required to convert from one string to another. It can be shown as Formula (2),
represents the length of the attribute name:
(b) Similarity between the description of attribute column
refers to the comparison of data that belongs to different attributes. The similarity between the description of the attribute column is calculated in the same way as the similarity between attribute names. It compares the similarity between each data. The formula is (3).
(c) Similarity between value type of attribute column
is used to indicate that whether attributes’ type of value is matching. Because in the real world, different data resources may have different data types on similarity attributes. In order to considerate different situation generally, this paper will divide the data types into basic data types, such as numeric type, character type, date type. If two attribute types belong to one same data type, then , otherwise .
(d) Similarity between value distribution of attribute column
refers to value ranges of attributes. Calculation of the similarity between values can be divided into two ways [
33]. The first case is when the value type is numeric type or similar numerical type. KL divergence and JS divergence can be used to calculate the similarity between distributions [
34]. KL divergence is an asymmetric measure of the difference between two probability distributions A and B. The asymmetry measure means
. The formula for KL divergence is as follows.
It is easy to cause deviation and low accuracy when comparing the attribute similarity in our study. JS divergence measures the similarity of two attribute distributions. It solves the problem of asymmetry in KL divergence. And its value range is from 0 to 1. The formula for JS divergence is Formula (5)
We choose to use JS divergence to calculate the similarity between the distribution of attribute values. The higher the JS divergence, the greater the difference. In order to make the similarity measurement results consistent, the higher the JS divergence, the higher the similarity. We modify the formula of JS divergence. The Formula is (6).
If the value type is character type, now there are few researches on the similarity between character types. In this paper, the frequency of each character type under this attribute is first counted and then JL divergence is used to calculate the similarity between distributions.
(e) Similarity between value range of attribute column
are used for making sure whether attributes that value belongs to numeric type is similar in domain or not. Generally, if the value ranges of an attribute match each other, then the attribute must have connections. The range of values can be compared by taking the maximum value and the minimum value directly. If value range of two attributes matches each other, then , otherwise . For character data, the value range can be defined by using the frequency of plastids between attributes. Take the data with the highest occurrence frequency and the data with the lowest occurrence frequency as the upper and lower limits respectively. Then it can use the method of name similarity comparison to compare.
According to the above definition of attribute similarity, we use Euclidean distance to calculate the joint attribute similarity. The calculation formula of attribute similarity is shown as Formula (7).
Definition 6. Attribute Cementation. During the process of data resource fusion, the operation that combines the same or similar attributes in data set which waiting for fusion into a same attribute is called attribute cementation.
When the types of two attributes in data set are different, if one of them is numeric type and the other one is character type, then there is little probability that these two attribute types are similar. Especially for a database with standardized design, there is no need to consider attribute cementation for these fields with little probability to be similar. The judgement of similarity degree of correlation attribute refers to
Table 1.
For the unfused attributes that without cementation processing, in order to make sure the integrity of structure, these attributes need to be added into attribute results. This involves attribute addition.
Definition 7. Attribute Addition. During the process of data resource fusion, in order to make sure information’s integrity, the attribute sets whose similarity with other attributes are lower than threshold will be added into attribute results. This operation is named attribute addition.
After each source data sets are processed in the process of data faultage, data sets merge with each other. After the fusion, the information like the source of each record in data sets need to be record in order to make sure of information’s integrity.
Definition 8. Incursion. During the process of data resource fusion, for signing every source of data record after fusion and fail to match the attributes whose similarity is higher than the threshold, the attributes need to be newly added. Such operation is called incursion.
Definition 9. Mapping. The operation that matches the entity record in data resource and heterogenous data is called mapping.
The essence of mapping is the operation that equivalently transfers heterogenous information into another pattern. It is the basic step of structural data faultage processing, and also the final goal of it.
Definition 10. Isomorphous Homonuclear. There are two or more records in data resource show a same entity object in the real world. After the processing of data faultage, the records are combined into one. This operation is called isomorphous homonuclear.
In the data fusion of different data tables, the main difference is the fusion of attributes. When attribute’s definition is the same, data can be fused straightly based on the attribute value. For instance: in two attributes, one is named by lowercases letter, and the other one is by capital letter; when two attributes are named in a same way and only the units of measurement of values are different, all of them can use attribute cementation processing to treat it. For the data fusion of different attribute definition, attribute addition processing can be used to meet such situation. In order to reduce the data redundancy, isomorphous homonuclear can unify the attributes that use different name to indicate the same content. In the end, the data tables after structural data faultage processing should satisfy data structure’s integrity and consistency.
3.2. The Main Idea
The definition and detection of structural data faultage describe and measure the differences among several data tables from the perspective of structure. Therefore, when processing the data tables with structural data faultage, the primary consideration is the impact of different data tables’ structures on particular data—since most of the time, the entities in fused table are different. Moreover, the situation will result in losing key data without being well treated, then which will bring difficulties to the subsequent data analysis. In this way, structural data faultage should be dealt from the completeness and consistency of data tables comprehensively.
After data fusion, the biggest problems are the missing and repetition of data. This algorithm is mainly used for eliminating structural data faultage, and ensure the integrity of data structure and efficient resource utilization. The three main operations of the algorithm are attribute cementation, attribute addition and isomorphous homonuclear. The three operations were respectively explained in Definition 6, Definition 7 and Definition 10.
The premise of this algorithm is to carry out subsequent process according to attribute similarity. Compared with other semantic similarity computing methods, the semantic computing method combining multiple relationships can improve the accuracy of semantic computing [
35]. According to the definition of relevant attributes in 3.1, We calculate the joint similarity between attribute names, attribute descriptions, attribute types, attribute distributions, and attribute value ranges. Thus, the calculation result of attribute similarity is more accurate.
The algorithm firstly worked out the similarity of unfused and fused attributes after data fusion, and then decided what to do with structural data faultage based on attribute similarity. The algorithm can determine whether the data was partially missed or totally missed, as well as the relevance among the attributes. The flowchart of this algorithm is shown in the
Figure 3.
Attribute cementation primarily focuses on the partial missing of data, which refers to the partly missing data fields in some fused attributes after data fusion. Some attributes as well as their field’s content all missed after data fusion. This is the complete missing of data. Attribute addition can add the missing attributes as well as the data fields into the data fusion table completely, and then make sure that none of the data fields are missing. After the processing of data structure’s completeness, data fusion table must have some problems like content repetition and data redundancy. Isomorphous homonuclear processing can unify the content of similar attributes, and explain the correlative connection of data in the data dictionary. Those processions are to make sure the consistency of data’s structure, reduce data redundancy and make it convenient for accessing data.
The steps of algorithm are as follows:
- (a)
First: Data fusion. The data tables need to be processed are fused according to the actual situation.
- (b)
Second: After the completion of data fusion, structural data faultage processing began. First of all, extract the data content without being fused in data resource.
- (c)
Third: Calculate the attribute similarity between the unfused and fused data.
- (d)
Fourth: Set the threshold T. Attribute similarity is sorted from high to low. Each attribute similarity corresponds to two attributes. At the same time, check whether the two attribute names corresponding to the attribute similarity belong to the same semantics. If inconsistencies occur, the comparison is terminated, and the similarity of the previous attribute of this attribute similarity is taken as the threshold.
- (e)
Fifth: Set the unfused data attributes that are bigger than or equal to the threshold T as partly missing of data. Set the attributes smaller than the threshold T as complete missing of data.
- (f)
Sixth: Process the attributes as well as the field content whose data is partly missed through attribute cementation. Based on fused attribute units, map the corresponding data fields that are partly missed to the attribute in the table of data fusion.
- (g)
Seventh: Process the attributes as well as the field content whose data is completely missed through attribute addition. Make the complete missing part of data incursion to the table of data fusion.
- (h)
Eighth: Do the operation of Isomorphous Homonuclear. At this point, the data fusion table has the highest content integrity, but there are many redundant data and duplicate data in the data table. Many data express the same semantic meaning with different names which result in a large proportion of space resources and high data redundancy. After the seventh step, we calculate the attribute similarity between the data attribute columns. The attributes with large correlation are selected for Isomorphous Homonuclear. And label the correlative connection of attribute value in the data dictionary.
The Isomorphous Homonuclear is also based on attribute similarity. We set up the attribute connection diagram according to the similarity. The attribute is listed as the node and the similarity represents the edge. It can be found that the span between two of these similarity values is much larger than that between other similarity values by sorting the similarity between each attribute column and other attribute columns in descending order. The attribute column before this span has similar distribution to that column, and the data distribution after that is completely different. We select the critical value of this span as the threshold .
We classify the attribute column
that satisfy
as approximate matching attribute of
. And connect
with
to form the attribute connection diagram. If there is an approximate matching attribute
in the attribute connection diagram of attribute column
, and there is an attribute
in the attribute connection diagram of attribute column
. It indicates that attribute column
and attribute column
match each other The attribute
and
express the same semantic meaning and can be combined with each other. If no other attribute column matches the attribute column, there is no attribute that can be combined with it. The attribute connection diagram is illustrated in
Figure 4. Finally, the attributes that match each other are represented by one attribute according to the attribute connection diagram.