1. Introduction
Indoor localization schemes, which are also termed as positioning schemes, have received significant attention recently because of their potential for use in areas such as smart factories, mixed reality, indoor navigation, security and advertising services [
1]. In general, indoor localization can provide the following benefits: better user experience for navigating indoor spaces, where GPS is not practical; enabling smart building operations and enhancements; improving the efficiency of robots or unmanned aerial vehicles (UAVs) in smart factories, and allowing users to find equipment with ease [
2,
3]. However, most studies so far have focused on estimating the two-dimensional (2D) position of the target. However, it is also essential to determine the three-dimensional (3D) position, that is, the height as well, of the target (for instance, the height of a robot’s arm or UAV in a smart factory, height of equipment in a building, and height of security features).
Most current indoor localization technologies are based on time, angle, or electromagnetic wave data; these are also known as the time of arrival, angle of arrival and received signal strength (RSS) schemes, respectively [
4,
5]. Among the various indoor localization schemes available, the RSS-based scheme is used widely owing to its simplicity and low hardware requirements. RSS-based methods can be divided into two categories: those involving trilateration based on the range estimated from the RSS values and the fingerprinting method, which is based on RSS fingerprint matching. Considering the variations in the RSS values in indoor spaces, however, it is difficult to accurately determine the distance information from the RSS data.
Previous studies on indoor localization suggest that several methods use the fingerprint matching scheme as the basic scheme for the localization of the target. The main idea is to first build a fingerprint database that collects the surrounding signatures at every predefined location in the areas of interest. Subsequently, the position of the target is estimated by matching the measured fingerprint with the database. Many researchers have striven to exploit the RSS value signatures for RSS-based fingerprinting techniques owing to the simplicity and low hardware requirements of the process. The first fingerprinting method based on a Wi-Fi device was introduced in [
6]. The authors determined the fingerprints of the RSS value and then used a deterministic method, namely, the k-nearest neighbors technique, for position estimation. Subsequently, RSS measurements from other transmitter devices, such as RFIDs [
7], Zigbee [
8] and Bluetooth devices [
9], have also been used for localization based on the fingerprinting technique. Moreover, classical machine learning methods such as the support vector machine model have been employed with the RSS fingerprinting technique [
10]. In [
11], a probabilistic Bayesian method was introduced to determine the difference between the test and saved RSS values. The main challenge in the case of RSS-based fingerprinting localization methods is that their positioning accuracy is readily affected by the random fluctuations in the RSS values caused by fading and the multipath effect. In addition, the complexity of the matching algorithm as well as that of classical machine learning algorithms increases significantly as the number of positions to be estimated increases. Thus, one requires more storage space and additional computing resources.
With the emergence of graphical processing units (GPUs), the convolutional neural network (CNN) became a focus of research interest again in 2012 [
12]. Significant advances were made in image processing through the development of CNNs such as AlexNet, Zfnet, and GoogLeNet [
13,
14]. Since CNNs exhibit improved performance by extracting more features from raw data during image classification, many researchers have tried to use them with one-dimensional (1D) signals such as temporal signals. Moreover, a 1D CNN was developed recently to reduce the computational complexity for 1D signals [
15,
16].
In this paper, we propose an indoor 3D localization scheme based on both fingerprinting and a 1D CNN. Instead of using the conventional fingerprint matching method, in the proposed scheme, the 3D positioning problem is transformed into a classification problem, and a 1D CNN model that uses the RSS time-series data from Bluetooth low-energy (BLE) beacons is used for classification. The contributions of this study can be summarized as follows:
We propose an indoor 3D localization scheme based on both fingerprinting and a 1D CNN. While most of the studies so far have focused on estimating the 2D positional information of the target, we propose a 3D localization scheme based on the fingerprinting technique. We convert the 3D positioning problem into a classification problem by dividing the 3D space into a set of unit cubic grids and process the RSS time-series BLE signal as a 1D signal in order to solve the localization problem using a 1D CNN.
We develop a 3D positioning system, which consists of BLE beacons and a Raspberry Pi receiver, for evaluating the performance of the proposed scheme. Using the developed system, time-series RSS data are collected from the beacons at each location, and the collected data are divided into training and testing datasets for the 1D CNN model.
We evaluate the performance of the proposed scheme through comprehensive tests. First, the convergence and accuracy of the 1D CNN scheme are evaluated, and the effects of data preprocessing and the kernel size on the proposed scheme are investigated. Next, we compare the classification accuracy of the proposed scheme with that of the conventional common spatial pattern (CSP) algorithm.
The remainder of this paper is organized as follows. In
Section 2, we briefly review some relevant literatures and compare their strength and weakness. In
Section 3, we introduce the characteristics of the BLE signal and describe the developed 3D localization system. In
Section 4, the indoor 3D localization scheme based on both fingerprinting and a 1D CNN is proposed. The performance of the proposed scheme is evaluated through tests, whose results are described in
Section 5. Finally, the conclusions of this study are summarized in
Section 6. Note that abbreviation and description in this manuscript are summarized in
Appendix A (see
Table A1).
2. Related Works
In indoor fingerprint positioning, it is more common to use WI-FI signal as the measurements. Baoqi Huang et al. introduced an eight-layers DNN model for indoor Wi-Fi signal fingerprinting localization in [
17]. In signal processing, they utilized stacked auto encoders to extract representative features from the collected data and defined a special loss function to train DNN. However, positioning errors were beyond 2 m in indoor environment. In [
18], Yifan Wang et al. developed a Wi-Fi fingerprint location recognition DNN method based on geometric distribution of fingerprint points to estimate the user’s position, and then exploited the constrained Kalman filter algorithm and the hidden Markov model to optimize the final results. However, this fuse method led to complexity of algorithm and has a large error in the specific location. Although Wi-Fi signals have the physical characteristics like high transmit power, more bandwidth, and large coverage, the cheaper and more low-power BLE sensor is used in related fields. In [
19], Charu Jain et al. compared a variety of machine learning methods based on BLE signal in solving floor classification problem. However, they cannot cover the more developed neural network. Since University of Toronto Geoffrey E. Hinton’s research group won the championship through the constructed CNN network, AlexNet, in the 2012 ImageNet Large-Scale Visual Recognition Challenge [
13], CNN has not only attracted the attention of many researchers in the image field, but also it keeps going to conquer the battlefield in other fields. In [
20], Danshi Sun et al. utilized CNN to achieve positioning in multi-floor large-area indoor environment. They converted the BLE RSS value into “fingerprint image” to train the 2D CNN and to predict floor categories, and then combining with magnetic field data to locate the transmitter’s position. In [
21], they carried out a PSO-aided 2D CNN architecture for the indoor positioning system, in which PSO is used for optimal parameter selection. However, both of them restructured the time-series BLE signal vector into image-liked matrix. This will lead to high time complexity and unwarranted execution time of the localization method. In [
22], Kodai Tasaki et al. developed a 3D CNN based on BLE RSS value to against statistically fluctuated signal due to random wireless channels. By taking the spatiotemporal structure of RSSI data set into consideration, the results showed a good result in fingerprinting than 2D CNN. However, it needs a spatial correlation for all the obtained RSS values, and special and complex operations in constructing the input dataset.
For indoor localization, the information of height is also important factor in industrial or commercial scenarios, the above-mentioned articles work well in floor and 2D positioning, but they did not focus on much 3D position information. In addition, due to the temporal feature of collecting BLE signal data, it is necessary to develop a 1D CNN to fully take advantage of the temporal signal. In this paper, we use the 1D CNN algorithm to accomplish the 3D spatial position classification problem, which will achieve more accurate and efficient positioning. In addition, this method has excellent positioning accuracy on small spatial environment.
4. Proposed Scheme
4.1. System Framework
Without a loss of generality, a 3D space can be divided into a set of unit cubic grids. The considered space is divided into M unit cubic grids, which can be considered as M distinct spatial locations. When N BLE beacons are used, a time-series of the RSS values from the N beacons can be collected at each location, and the collected data can be labeled as . Note that each label indicates the premeasured 3D coordinates of the corresponding cubic grid. The RSS measurement result at location , , is combined with its label, , to form a training sample, , for the proposed scheme. The time-series of the RSS values at location from the n-th beacon can be expressed as , where is the -th RSS value and is the length of the time-series of the RSS values.
In this study, we used
N (=8) BLE beacons and divided the space into
M (=16) grids with a unit size of 1 m × 1 m × 1 m, as shown in
Figure 4. Therefore, the RSS value vector from the 8 beacons at a certain position,
, can be expressed as
, where T represents the transpose operation, and the time-series of the RSS values for all the positions is expressed as
. Note that the input for the training phase is denoted as
, while
represents the input data for the prediction to evaluate the performance of the proposed scheme. In the prediction phase, the location of the target is estimated based on its RSS value,
. The layout of the proposed scheme is shown in
Figure 5. As can be seen from the figure, the scheme consists of two phases: the training phase and the positioning phase. The 1D CNN model is trained using the training dataset in the training phase. Next, the trained model is used to predict the location of the target from the input data,
.
4.2. Data Preprocessing
Generally, it is essential to perform data preprocessing for efficient model convergence when using a neural network. The following data preprocessing methods are employed in the proposed scheme.
4.2.1. Homogenization of RSS Values
Theoretically, a signal scanning process should be enough to obtain the RSS values of all the available BLE sensors in the surrounding environment. In actual implementations, however, no signal scanning process can obtain all the signals because of differences in the signal strength through the different propagation channels and the resulting packet loss. In addition, the receiver may not be able to obtain the same number of temporally consecutive data values from all the beacons owing to the differences in the sampling time. If the length of the samples for each label is not the same, a bias can occur in the training phase. Therefore, it is essential to construct a homogeneous dataset from the heterogeneous dataset. Hence, we constructed RSS value vectors of the same length from consecutive samples to ensure that the input data requirement for the 1D CNN was met. In the proposed scheme, the minimum principle is adopted to prepare the valid signal frame for training the 1D CNN. This means that we chose the sample with the minimum length as the benchmark for all the samples at all 16 sampling positions.
4.2.2. Elimination of Outlier Values
Outliers are generated when the sensor is switched on or off or when there is significant interference, such as that from human activity. To reduce the effect of outliers, the interquartile range (IQR) method has been introduced [
27]. The idea of this method is to first rank the data and then choose the interquartile points, denoted as Q1, Q2 and Q3, in ascending order. Then, using the first quartile point, Q1, and the third quartile point, Q3, the reliable interquartile range can be obtained as follows:
After obtaining the IQR, the first and second inner limitation (IL) values can be calculated as follows:
If the RSS value is bigger than the second IL value or smaller than the first IL value, it is regarded as an outlier, while the data values that lay within the confidence interval, that is, between the first and second IL values, can be trusted, as shown in
Figure 6. Therefore, we can construct a training dataset of the form
from the raw observations
. This process ensures that the trained model is not polluted by unstable BLE RSS outlier values.
4.2.3. Data Normalization
The BLE RSS values typically ranged from −70 dB (lowest) to −30 dB (highest). However, we normalized the scale of the RSS values because the input values should be limited to the range (0, 1) for ensuring that the CNN training efficiency and coverage speed are high [
28]. The min–max normalization method was adopted for this [
29]:
where x
min is the minimum RSS value of the data collected from a beacon. Note that the measured values of all the beacons were normalized independently for every location, instead of normalizing the measured values together.
4.3. 1D CNN Model
In the case of conventional schemes, the theoretical relationship between the RSS value and the distance is used to estimate the location of the target. It is known that theoretically the time-series of the RSS values for a specific location does not change significantly over time. Based on this characteristic, many researchers have introduced RSS-based fingerprinting schemes based on statistical features such as the entropy, mean, and variance of the time-series of the RSS values to estimate the location. However, this requires designing and extracting features related to the temporal characteristics of RSS values based on the specific situation and thus is not a universal approach [
30,
31,
32].
Since CNNs show excellent performance with respect to the extraction of additional features from raw data during image classification, many researchers have attempted to use them with 1D signals such as temporal signals. Moreover, a 1D CNN was recently developed to reduce the computational complexity for 1D signals. Since the 3D positioning problem was transformed into a classification problem and the RSS time-series of the BLE signals was considered a 1D signal in the proposed scheme, a 1D CNN was adopted for solving the problem.
Figure 7 shows the general process of 1D convolution. Randomly initialized filters, which also termed as kernels, perform convolution extraction, and then scan the entire input data along a certain stride. The extracted outputs make up the feature map.
In the next section, the 1D CNN model of the proposed scheme is described in detail. We used five different layers, which are the convolutional layer, pooling layer, dropout layer, fully connected (FC) layer, and output layer.
4.3.1. Convolutional Layer and Pooling Layer
The function of the convolutional layer is to extract the feature map. In the convolutional layer, filters that are randomly generated using different initialization values traverse every sample,
, of the input training dataset,
, along a specific stride and extract features from it. In this manner, the feature map is obtained as the output of the convolution layer, as shown in
Figure 7. The number of filters used affects the resolution of the feature output. Generally, the higher the number of filters used, the higher the number of features extracted from the original signal and thus the higher the resolution. The hyperparameters of the convolution layer include the convolution filter size and the stride size, and these determine the size of the feature map. For example, a convolution layer without padding produces an output volume of [
] if it uses 12 filters, whose window size is 3, and the stride step is 2 and input volume is [
].
The output of the convolutional layer exhibits information redundancy and thus a high computing cost. The function of the pooling layer is to down-sample and resample the input data to extract additional features and compress the data to improve the computational efficiency. The pooling function abstracts the input data within the window interval and regards the output as a representative value for the pooled RSS features, as shown in
Figure 8. The main parameter of this layer is the stride size, which determines the width of the information extracted.
Both max pooling and average pooling are commonly used pooling functions. The max pooling function calculates the maximum value of the RSS value vector within the window, while the average pooling function calculates the mean value of the window. According to the relevant theory, during feature extraction, errors arise primarily because of two factors: (1) the increase in the variance of the estimates caused by the restricted neighborhood size and (2) the offset of the estimated mean value caused by the parameter error of the convolution layer. In general, the average pooling function can reduce the first error, while the max pooling function can reduce the second error. Hence, the max pooling function is used in the proposed scheme. For the entire network, this merely meant the down-sampling of the results obtained from the upper layer and reducing the number of training parameters to avoid overfitting.
4.3.2. Dropout Layer, FC Layer, and Output Layer
The dropout layer is used to solve the problem of overfitting in deep learning [
33]. The underlying idea of the dropout layer is to randomly disconnect nodes at a given rate. In the proposed scheme, this layer is placed after the convolutional layer to improve the network diversity. After cross-validation, the best results were obtained when the implicit node dropout rate was set to 0.5.
The output of the max pooling layer is stored in a long vector after passing through the flatten layer, and the data become one-dimensional and is used as the input of the FC layer. The FC layer is usually at the tail of the CNN, and it is similar to the most common artificial neural network named the Dense layer. In the FC layer, all the neurons are fully connected by weight, and each neuron has a class score. The cumulative sum of the neurons is the input to next output layer.
The SoftMax function, which is used widely for multiple classifications, is employed as the output activation function in the output layer. In the case of the considered system, the class with the highest probability amongst the 16 categories is taken as the estimated label for the corresponding input. Since the sum of the probabilities for all the classes is equal to 1, we can estimate the location of the target in terms of the spatial coordinates by performing the regression, as follows:
where
C is the set of all the predefined coordinates for all the 16 reference locations, and
P(
i) is the estimated probability of each class. ReLU was selected as the activation function for all the applicable layers except the output layer, where the SoftMax function was used. Adam was used as the optimization algorithm instead of the classical stochastic gradient descent method to update the network weights iteratively based on the training data [
34]. The categorical cross-entropy loss function was adopted because it is well suited for tasks involving multiple classifications. The parameters for the 1D CNN model are listed in
Table 1.
4.3.3. Summary of 1D CNN Model of Proposed Scheme
In a typical CNN, the data type is generally a single-channel grayscale image or a three-channel color image. Analogously, the data used in this study can be considered multichannel monoscale (gray) images. However, in contrast to the case for actual images, the total number of beacons was taken to be the number of channels for training the 1D CNN model. In other words, the number of channels was set to 8 because 8 BLE beacons were used. The RSS value sequences from all the BLEs were divided into individual samples, with each consisting of 32 RSS values, in a sequential and nonoverlapping manner based on the chronological order of reception.
In order to ensure a wider feature extraction range, which is expected to yield more features, in the case of the input data, we did not go directly to the pooling layer after performing one convolution. Instead, we performed the convolutional extraction twice, which means after the first convolution layer, we added second convolution layer to execute feature extraction again. Thus, in this manner, we not only limited the number of parameters but also improved feature extraction.
To begin with, all the input samples are in the form of , where is the number of samples, 32 is the RSS value length of one sample, and 8 is the number of BLEs. It can be processed by two times of conventional operation with 32 filters, of which window size is 3 and stride is 1 in default and no padding used, and then output map size is . After that, the dropout layer is added to mitigate the effects of overfitting. By being made to pass through the max pooling layer with size 2, stride 2 and no padding used, the local features in the form of are down-sampled to local features in the form of . Next, the data passes through the flatten layer and are converted into 1D vector data of the form and then fed to the FC layer. After the data has passed through the FC layer, the output layer with the SoftMax function is used to determine the label corresponding to the predicted result. As a summary, conventional filter size and max pooling size are 3 and 2, separately, the stride size are 1 and 2 successively, both of them were no padding used and the number of datapoints in the training batch was set at 32, the number of parameters for the entire network was 12,698 from model own statistics.
5. Performance Evaluation
The performance of the proposed scheme was evaluated through comprehensive tests. First, the convergence and accuracy of the 1D CNN scheme were evaluated, and the effects of data preprocessing and the kernel size on the proposed scheme were investigated. Next, the location estimation performance of the proposed scheme was evaluated. We compared the classification accuracy of the proposed scheme with that of the conventional CSP algorithm. All the tests were performed using Python 3.8 on a desktop equipped with an Nvidia GeForce GTX 1650 GPU and an AMD FX(tm)-6300 3.50 GHz six-core central processing unit. FeasyBeacon 5Mart FSC-BP104, which is a Bluetooth 5.0 BLE smart beacon with a TI CC2640R2F chipset and works at an ISM frequency of 2.4 GHz, was used [
35]. We set the data transmission interval to 100 ms and the transmission power to +5 dBm. In addition, we set the broadcast mode as the transmission format, and the broadcast packets followed the specifications designed by the company.
5.1. Loss and Accuracy Performance of Proposed Scheme
The effectiveness of the proposed scheme was evaluated based on the loss function and accuracy of the training process. The validation process was performed for up 20 epochs. During the tests, approximately 200 samples were collected at each predefined location, and the length of the RSS value for each sample was set to 32. The complete dataset of 3200 samples was randomly divided as follows: 70% training data, 10% validation data, and 20% test data. The order of the samples in the training data set was randomly shuffled.
Figure 9 shows the loss and accuracy performance of the proposed scheme. As can be seen from the figure, both curves changed significantly and converged at approximately Epoch 3. Since the convergence rate was high, it can be concluded that the dataset was suitable for the proposed scheme.
5.2. Effect of Data Preprocessing on Proposed Scheme
A major benefit of neural networks is that prior knowledge of the noise distribution is not required. Noisy RSS value measurements can be used directly to train the network, and the neural network is capable of characterizing the noise and compensating for it to determine the target position with accuracy. To estimate the effect of outlier preprocessing on the training of the 1D CNN model, comparative tests were performed; the results are shown in
Figure 10. As can be seen from the figure, the dataset subjected to outlier preprocessing resulted in better loss and accuracy performance than that not subjected to it. In addition, it can be seen from the loss function curve that the convergence point, for the dataset not subjected to outlier preprocessing is at approximately Epoch 10. Thus, convergence in this case took three times longer than that for the dataset subjected to outlier preprocessing (approximately, Epoch 3). This means that noisy interference or outlier values can add to the complexity of network learning and that data preprocessing is necessary for efficient model training.
5.3. Effect of Kernel Size on Proposed Scheme
The effect of the kernel size used for convolution was also evaluated to optimize the performance of the 1D CNN. We used kernel sizes of 3, 6 and 12 to test the 1D CNN model in terms of loss and accuracy. The results are shown in
Figure 11.
As can be seen from the figure, the performance deteriorated as the kernel size was increased. Specially, both the loss and the accuracy were the worst for the kernel size of 12. On the other hand, for the kernel sizes of 3 and 6, the performances were similar. This means that a large convolution window is not preferable for extracting more reliable features from a large set of widely fluctuating RSS values. In addition, a large window also increases the computational burden. Thus, the size of the convolution kernel was set to 3.
5.4. Position Estimation Performance
After the completion of the training and validation processes, we evaluated the performance of the proposed scheme in 3D position estimation using the test dataset,
. Since the 1D CNN model provides the probability of each possible category of the target’s location, the coordinates of the target can be calculated using Equation (7). To allow for a visual comparison of the estimated and actual positions, we tested 340 samples from all 16 categories. The results are plotted in
Figure 12. In the figure, the green stars are the estimated positions of the target while the red crosses represent the actual positions. As can be seen from the figure, the proposed scheme could estimate the 3D position accurately.
Next, we compared the classification accuracy of the proposed scheme with that of the conventional CSP algorithm.
Figure 13 shows the comparison of the results obtained using the proposed scheme and the CSP algorithm for all the classifications. As shown in the figure, the 1D CNN model outperformed the CSP algorithm in the case of every position category.
Figure 14 shows the cumulative distribution functions (CDFs) of the estimated coordinate errors for the 1D CNN and CSP schemes. As can be seen from the figure, the 1D CNN scheme significantly outperforms the CSP scheme. The localization errors of the two schemes are also compared in
Table 2. As per the test results, the mean error of the proposed 3D localization scheme based on the 1D CNN is 0.25 m, while that for the CSP scheme is approximately 2 m.
The reason the proposed scheme exhibits higher accuracy may be the independence of the data obtained from the beacons. During the experiments, we deployed 8 BLE beacons at different locations, and the Raspberry Pi receiver collected 32 distinct RSS values from each BLE transmitter beacon for one location. This means that the proposed scheme exploits the time-series data of each beacon using the 1D CNN while ensuring that the data from the multiple beacons remain independent.
6. Conclusions
In this paper, we proposed an indoor 3D localization scheme based on both fingerprinting and a 1D CNN. In the proposed scheme, instead of using the conventional fingerprint matching method, the 3D positioning problem is transformed into a classification problem, and a 1D CNN is used with the RSS time-series data from the BLE beacons to determine the target locations. By using a 1D CNN with the time-series data from multiple beacons, the inherent drawback of RSS-based fingerprinting, namely, its susceptibility to noise and randomness, could be overcome, resulting in enhanced positioning accuracy. To evaluate the proposed scheme, we developed a 3D positioning system, including BLE signal reception and uploading process, introduced multiple signal preprocessing methods and performed comprehensive tests in real scenarios. One the one hand, we evaluated our proposed 1D CNN model itself. On the other hand, in terms of the positioning accuracy, the results showed that the proposed scheme significantly outperforms the conventional CSP classification algorithm. The accuracy of the proposed scheme in 3D location classification was almost 100%, while that of the conventional CSP scheme was only 70%. Moreover, the mean error of the proposed 3D localization scheme based on the 1D CNN was 0.25 m while that of the CSP scheme was approximately 2 m. Our proposed scheme can be used in small-area indoor environment and improves the practicality of 3D positioning. In future work, we plan to investigate the coverage problem of Bluetooth signals, hoping to find the optimal coverage solution, and we will discuss the impact on the computational complexity of the model and stability as the number of nodes varies.