1. Introduction
Wireless sensor networks (WSN) have become popular in various areas [
1,
2,
3,
4]. For example, in agricultural production, WSN are used to monitor the growth environment and status of crops; in terms of industrial safety, WSN are applied to monitor the safety of dangerous working environments such as coal mines, oil drilling, and nuclear power plants to ensure the safety of workers. However, due to the interference of many external factors such as sensor aging, instrument failure, measurement methods, and human interference, the data collected by sensors may not be precise or reliable [
5]. The existence of these unreliable data will lead to inefficient data utilization, waste of economic costs, and even serious decision-making errors. Thus, false alarms and missed alarms, which reduce the performance of the monitoring system greatly, will be caused by the unreliable monitoring system. If the credibility of the original data can be evaluated effectively, then the unreliable and low-quality data can be detected from the original data, contributing to the data processing, and the remaining high-reliability data will improve the early warning capabilities of the supervision and inspection system, and thereby the safety of people’s life and property can be guaranteed to the maximum extent. For real event data, if the data collected by data sources reflect the real environment, this method considers the data as credible, which could be used in further data analysis. When the data of multiple sensors are interfered, this method will not be able to make an effective evaluation of the data, so the event data and unreliable data cannot be handled.
In order to evaluate the credibility of data, scholars have conducted many methodological studies. For example, Marsh [
6] first proposed a mathematical model for calculating credibility, and at the same time, explained the definition of data credibility. He considered that credibility is a way to understand a complex environment, a means to provide additional robustness to independent subjects, and an effective indicator to make judgments based on the experience of others.
According to the above definition, Marsh suggested several important directions for subsequent research. Galland [
7] presented three fixed-point algorithms based on the complexity of the data source to evaluate the credibility of the data provided by the data source. Furthermore, based on the experimental results, he designed two schemes to enhance robustness of the method: establishing weights for different data sources and using prior knowledge. Aiming at the uncertainty and conflict of multi-source data, Guo [
8] et al. used a reliability evaluation framework based on evidence theory, which combines prior information and context information, and the algorithm can be improved by correlating with more information. Jikui Wang et al. [
9] suggested an evaluation method of data with Bayesian estimation theory, which relies too much on the accuracy of the prior data of the data source. Scholars such as Fernando [
10] aimed at the problem of evaluating the credibility of sensors for autonomous driving systems by machine learning to build a framework for data credibility, and using Lidar to verify the system. However, models are directly established from the data in the above algorithms, without constructing the spatio-temporal relationship between the data sources. Thus, it is not possible to exploit the advantage that the correlation between data sources can handle the uncertainty of the data. For the literature [
11], Xinjian Wan et al. designed a credibility measurement model, which considered the relationship between data sources, in light of the problem that the credibility of air quality monitoring data cannot be evaluated. Hence, this was our main comparison method in this article.
Multi-source data fusion is a processing means for multi-source data to derive estimates and judgments from the original data sources through knowledge-based reasoning and recognition to enhance the confidence of data, improve reliability, and decreased uncertainty, with the advantages of comprehensive information, high fault tolerance, reduced cost, and broader coverage. It is a complex estimation process. In order to make full use of the information in the image fusion process, Xue J et al. [
12] developed a data fusion method based on Bayesian, and finally the fused image was obtained by maximum a posteriori probability (MAP). However, the Bayesian criterion is greatly limited in engineering applications due to the need to obtain the prior probability of each data source, and this requirement is difficult to know in advance. In [
13], three multi-sensor data fusion algorithms based on Kalman filter, namely state vector fusion (SVF), measurement fusion (MF), and gain fusion (GF) are implemented in a tracking system. Results showed that the Kalman filter-based state vector fusion algorithm performed well comparatively for the system. Although the Kalman filter has good performance in target tracking applications, it also has the problem of poor real-time performance. L.Z. et al. [
14] presented weather recognition methods based on the weighted fusion of weather-specific features for classifying five types of weather. It is simple to achieve the weighted average method, but the real-time performance is also poor. Moreover, some algorithms do not take the fluctuation of data collected by sensors in the time series into account, and do not reflect the accuracy of data fusion. The BP neural network has the advantages of simple design, fast training speed, strong self-adaptive ability, and it has good fault tolerance because it does not cause a great impact on the global training results after its local or partial neurons are damaged [
15]. Here, this paper proposes a new comprehensive credibility of data measurement (CDM) method, which combines the correlation between data sources and the fusion algorithm. This method makes full use of the relationship between data sources to further enhance the robustness. It also utilizes the historical data of the target data source and the multi-source heterogeneous data to improve the accuracy of the evaluation. Based on the above characteristics of the BP neural network, the fusion algorithm was selected as a BP neural network. Moreover, the ARIMA model [
16] has the characteristics of stable prediction, simple, and easy implementation. Therefore, the model based on ARIMA and BP neural network can provide stable, real-time, and accurate assessment of the credibility of data.
The rest of this article is organized as follows. The definition of credibility and the components of data credibility are introduced in
Section 2, and the basic concepts of the ARIMA model and BP neural network are described in detail. The proposed method through examples is verified and the experimental results are presented in
Section 3. Our conclusion and next work plan are presented in
Section 4.
2. Materials and Methods
Currently, there is no clear and accepted definition of the credibility of data. Pinyol I [
17] discussed the objective aspect of credibility and argued that credibility can be regarded as a decision in the process of interacting with another party. Sabater J [
18] analyzed the user’s credibility evaluation mechanism of eBay, Amazon Auctions, and OnSale Exchange, and noted that the credibility models of these online markets all used single data that do not depend on contextual information. In [
19], the authors considered that data credibility is determined by a combination of data integrity, consistency, and data accuracy. Consistency indicates whether the logical relationships between related data are correct, data integrity describes the severity of data loss in the dataset, and accuracy demonstrates how close the measured data are to the real data. For this article, the credibility of the data provided by the data source was considered to be the degree to which the measured object in the expected environment reflects the true state. That is, the discussion focused on the accuracy index as an important index to measure the performance of the data credibility measurement model.
First, there is an overview of the methodology. Furthermore, there is an introduction of the symbols, a definition of comprehensive data credibility, a meaning, and the metrics of the parameters in formulas are described in detail. Finally, we briefly review the related algorithms presented.
2.1. Overview
In order to clarify the relationship of the different factors, we present the framework of the method, and the specific process of the method is shown in
Figure 1.
Figure 1a shows the framework of the method, which illustrates the relationship of the credibility of the data source and the combined credibility of the data, and shows the components of the different type of credibility.
Figure 1b demonstrates the process of predicting data directly through the time correlation of data.
Figure 1c shows the process of predicting the heterogeneous data through the spatio-temporal correlation.
2.2. Symbols
In this section, we describe the symbols used in this paper. The symbols and descriptions are shown in
Table 1.
2.3. Definition of Comprehensive Data Credibility
The evaluation method of comprehensive data credibility combines multiple sensors and multiple types of sensors in both temporal and spatial dimensions. Comprehensive data credibility is composed of data source credibility and combined data credibility. The calculation of data credibility is as follows:
where
is the comprehensive data credibility of the data source
;
represents the credibility of the data source
;
is the combined data credibility of the data source
; and
is the credibility threshold of the data source. When the data source is credible, the comprehensive data credibility is the result of multiplying the credibility of the data source and the combined credibility of the data. If the credibility of the data exceeds 0.5, the data is credible, otherwise the data is not credible. In this paper, the data credibility is represented by the continuous value between the interval [0,1]. Next, the credibility of the data source and the combined credibility of the data are elaborated.
2.4. Definition of the Credibility of the Data Source
Data source credibility will be calculated through trend correlation and mean aggregation, used to describe whether the data source is credible and whether it is working properly. If the data source is not credible, the credibility of the data is determined to be zero. The formula for the credibility of the data source is as follows.
where
is the credibility of the data source;
is the trend correlation of the data source
;
is the mean aggregation of the data source
; and
and
weighting factors of trend correlation and mean aggregation, respectively, where
. The weights are determined based on the importance of trend correlation and mean aggregation. Through multiple iterations, the weights are determined with the highest detection probability and the lowest false alarm rate. The definition of the detection probability and false alarm rate is introduced in the experimental section. The following parameters were the same.
- (1)
Calculation of trend correlation
Setting a distance threshold
, this is a Euclidean distance. When the distance between the target data source and a data source is less than the threshold, the data source is regarded as an adjacent data source. Trend correlation refers to the average degree of correlation between the historical data trend of the target data source and its adjacent data sources. If the data source is damaged or does not work correctly,
will be smaller. The mathematical expression of the trend correlation is as follows.
where
is the trend of a piece of the historical data;
is a correlation function; the data source
is adjacent to the data source
; and
is the number of adjacent data sources of the target data source; the vertical line indicates the absolute value;
is the number of historical data; and
represents the mean value of the historical data.
- (2)
Calculation of mean aggregation
Mean aggregation is the degree of aggregation between the average value of the target data source and its neighboring data sources of the same type over the same time. Similarly, it also describes the degree of dispersion of average data between data sources with spatial correlation. For example, when a data source is manually interfered, it is possible that the data trend of the target source is still consistent with that of the neighboring data sources, but the value will decrease, and the mean value will deviate from the mean value of other neighboring data sources, then
will be reduced. The calculation formula for the mean aggregation is as follows:
where
is the average value of the historical data; the data source
is the adjacent sensor of the data source
;
is the number of adjacent sensors of the data source
; the data source
is adjacent to the data source
; and
is the number of adjacent sensors of the data source
; and
is the normalized distance factor between data sources.
2.5. Definition of the Combined Credibility of Data
Combined credibility of the data consists of direct credibility of the data and heterogeneous credibility of the data. The formula for the combined credibility of the data is as follows:
where
is the combined credibility of the data provided by data source
;
is the direct credibility of the data provided by data source
;
is the Heterogeneous credibility of data; and
and
are the weighting factors of direct credibility of the data and heterogeneous credibility of the data, respectively, where
.
- (1)
Calculation of direct credibility of the data
Direct credibility of the data is the degree of membership of the measured value to the theoretical real value. Namely, the historical data of the data source is used for model fitting based on the ARIMA model, and then the theoretical real value at the current moment is predicted according to the time correlation based on the fitting function. The directly credibility is obtained based on the measured value and theoretical real value. The formula for the direct reliability of the data is as follows.
where
is the measure data of the data source
;
is the theoretical real value (direct prediction data) based on the historical data; and
is the credibility threshold of the data.
- (2)
Calculation of heterogeneous credibility of the data
Heterogeneous credibility of the data is the degree of membership of the measured value to the theoretical real value. Multi-source heterogeneous data, which are from the target data source and its multiple neighboring heterogeneous data sources within the same period of time, are used in data prediction currently, based on the BP neural network. Then, heterogeneous reliability is calculated according to theoretical real value and measured value. The formula for the heterogeneous reliability of the data is as follows.
where
is the theoretical real value (heterogeneous prediction data) of data source
at time
by fusing the data of the adjacent data sources.
2.5.1. Autoregressive Integrated Moving Average Model
In this part, we introduce the proposed time series forecasting model. The model can extract the historical data trend of the data source to predict the future data. First of all, the parameters used in this section are explained. The parameters and descriptions are shown in
Table 2.
The full name of the
is called the autoregressive integrated moving average model, which is also recorded as
.
is the most common statistical model used for time series forecasting [
20,
21], which is an extension of the
model [
22,
23,
24]. The relationship between the two models is shown in
Figure 2.
The
model consists of an autoregressive model and a moving average model, where,
and
are the two parameters of the
model, and the meanings of these parameters are shown in
Table 2. The mathematical representation of the
model is as follows:
where
is the prediction value at time
;
is a constant;
is the coefficient of the
model;
is the coefficient of the
model; and
is the error between the predicted value and the measured value.
Compared with the
model, the
model adds a difference algorithm, the purpose of which is to convert a non-stationary sequence into a stationary sequence to meet the stationary demand of
. Among them,
,
,
, are the three parameters of the
model. The meaning of each parameter is shown in
Table 2. The mathematical representation of the
model is as follows:
where
is the predicted value of the d-order difference of the data at time t.
It is important to establish an appropriate model for the accuracy of data prediction. There are three steps to build a time series prediction model, namely, model ordering. The model flow chart is shown in
Figure 3.
First, it is necessary to judge whether the time series data are stable, and if they are not stable, corresponding processing is required such as difference and logarithmic operations;
Second, if the data series is stable, the next step is to determine the order , , which can be determined artificially by calculating the autocorrelation function () and the partial autocorrelation function ();
Third, since there is a certain subjective component in the process of determining the order in the previous step, it is necessary to verify whether the parameters are reasonable through the index function, and the index function is selected as the Akaike Information Criterion ()/Bayesian Information Criterion (); and
Finally, after the model is established, the future data can be predicted through the model.
In the
model, the parameters of
can be determined artificially through the
and
, and the visualizations are shown in
Figure 4. Finally,
were determined as (3,3). Due to the strong subjective factors in order determination through
and
, we can again determine the order according to the information criterion function method. In this experiment, the
index function was selected, and the parameters
were selected as (5,3). Then, the results of the data credibility were obtained by the model with two different parameters and the results are shown in
Figure 5.
Observing
Figure 5a,b, the difference between the results was not very great. Both sets of parameters can be used. In this paper, the parameters of
are finally confirmed as (5,3).
2.5.2. Back Propagation (BP) Neural Network
The
neural network is a multi-layer feedforward neural network [
25,
26,
27] including an input layer, an output layer, and one or more hidden layers. The main characteristics are signal forward propagation and error back propagation. Specifically, the training process of the neural network can be divided into two stages. The first stage is signal forward propagation, where the input of the network is from the input layer through the hidden layer before finally obtaining the result from the output layer; in the second stage, error back propagation, error and gradient go from the output layer to the hidden layer and finally back to the input floor. The basic simple structure is shown in
Figure 6.
Suppose this network has
inputs,
outputs, and a hidden layer, among which there are
neurons in the hidden layer. Suppose the output of the hidden layer is
, the bias is
, the weight is
, and the activation function is
. The output of the output layer is
, the bias is
, the weight is
, and the activation function is
. The mathematical expression for calculating the output of the hidden layer is as follows:
The mathematical expression for calculating the output of the output layer is as follows:
The error function of the network is defined as follows:
where
is the true value (expected value) of the output.
The objective of the training process is to reduce the error as much as possible. The forward process is to calculate and the error , and then back-propagate the error to modify the weight coefficient of each layer, repeat iteratively, and until the training is ended
This neural network also takes advantage of the spatio-temporal relationship be-tween data sources, and uses data from neighboring data sources as input for learning. During the learning process, the number of neurons in each layer has a greater impact on the training results as it not only affects the training speed, but also affects the training accuracy. Here, on the basis of other parameters unchanged, we set the number of hidden layers of the network to 2, and the influence of the number of different neurons in each layer on the training accuracy is shown in
Figure 7.
In
Figure 7, it can be seen that when the number of neurons in each hidden layer is six, the error was the smallest. Therefore, the number of neurons was six, and the other parameters were set as follows: Learning rate
ꞵ = 0.01,
, and
.
4. Discussion
This paper used two kinds of data credibility—a direct and a heterogeneous credibility. The advantages relative to the individual use of one of them are shown below. The direct credibility utilized the time relationship through the historical data of the target source. Compared with the historical data of other sources, historical data and current data of the same target source had the greatest correlation in normal condition. Therefore, it is necessary to assess the credibility of the data in normal condition. Heterogeneous credibility utilizes the spatio-temporal relationship of the data sources, adding different dimensions to improve the evaluation accuracy. This is vital when the historical data of the target source is not reliable, if the source has been damaged. The combination of the direct and the heterogeneous credibility makes the method more accurate and more robust than the individual use of one of them.
Multi-sensor data fusion is an efficient tool for improving the performance because of the spatio-temporal relationship between data sources. However, there are some challenges and future work to be discussed.
- (1)
Improvements of the application of the method
In this work, we proposed a method that could effectively utilize the spatio-temporal relationship of the data sources. In order to simplify the problem, we assumed that data integrity and time lag between data were not considered in our experimental set. However, in field measurement, missing data is one of the most common problems. When sources lose data, the performance of this method will be harmed. Due to the hugeness and redundancy of wireless sensor network data, we should apply proper data preprocessing methods to deal with data missing and other data problems. In general, data interpolation is an effective method including quadratic interpolation, cubic interpolation, linear interpolation, and so on.
- (2)
Development of the fusion algorithm
Each fusion method has its own set of advantages and limitations. The combination of several different fusion schemes has been approved as a useful strategy that may achieve a better performance of results [
29]. However, the selection and arrangement of fusion schemes do not have not determined criteria [
30]. Further investigations are necessary to design the general framework to combine the fusion algorithm.
- (3)
Enhancement of the robustness of the model
The proposed method uses many neighboring sources to improve the performance. This method also has limitations. When there are many data sources that are simultaneously interfered, the model may fail. We expect that future research will continue to optimize the model.