1. Introduction
Vacant space detection systems can help drivers quickly and easily find available parking spaces, reducing the time and frustration associated with circling a lot looking for a space. This can also help to reduce congestion in the parking lot, improve traffic flow, and reduce emissions from idling vehicles. Additionally, parking lot managers can use the data collected by these systems to optimize parking lot layouts and improve overall efficiency.
Recently, many sensors have been available for use in vacant space detection. For instance, ultrasonic sensors [
1,
2] use sound waves to detect the presence or absence of a vehicle in a parking space. They are typically installed above each parking space and can accurately detect the distance to the nearest object such as a vehicle. Magnetic sensors [
3,
4] use a magnetic field to detect the presence or absence of a vehicle in a parking space. They are typically installed beneath the surface of the parking space and can detect changes in the magnetic field caused by the presence of a vehicle. Infrared sensors [
5] use infrared light to detect the presence or absence of a vehicle in a parking space. They are typically installed above each parking space and can detect changes in the infrared light caused by the presence of a vehicle. Video cameras [
6,
7,
8,
9,
10,
11] can be used to detect the presence or absence of a vehicle in a parking space. They are typically installed above each parking space and use image analysis software to detect changes in the video feed caused by the presence of a vehicle.
Among these possible sensors, many research works focus on vision-based systems to detect vacant spaces because a single camera can manage multi-parking slots. Several challenges should be addressed to have a workable vacant space detector, such as mutual occlusion patterns, shadow, layout, and size variances. From a literature review, supervised learning methods (Vu 2019 [
12] and Huang 2017 [
13]) require massive labeled data for a high accuracy of 99.74%. However, the domain gap could significantly reduce accuracy when the model is deployed in a new parking lot. As shown in
Figure 1, conventional methods may make a wrong prediction in a new testing domain, while our method can learn generalized features that work robustly in a new testing domain.
Transfer learning [
14] and unsupervised domain adaptation [
15] are popular solutions to relieve the above-mentioned problem. Transfer learning [
14] uses a pre-trained model to initialize the weight of the target network. As the pre-trained model is trained based on a vast dataset, its parameters could extract semantic information. Hence, the target model could converge faster with fewer training data. In the vacant space application, the pre-trained model could be trained based on data from a source parking lot; and the target network is a detector deployed in a new parking lot. Given a powerful pre-trained model, transfer learning requires enough newly labeled samples to fine-tune the target network to ensure success.
On the other hand, unsupervised domain adaptation [
15] requires labeled data on a source domain and unlabeled data on a target domain. An adversarial loss [
16] is used to train the sharing features between two domains, and labeled data in the source domain will be used to train a classifier based on the sharing features. To ensure the success of domain adaptation, the learned sharing features should be good enough for the specific task.
Both transfer learning and unsupervised domain adaptation need help from a pre-trained feature extractor trained on the source dataset. These features should be highly relevant to vacant states and invariant from environmental changes. Since different camera views create different occlusion patterns, learning in-variant features from an unseen domain (target domain) is challenging. To address the challenge, the paper tries to learn features that represent only objects (parking cars) but cannot present the background; even the background includes partial neighbor cars. We use mutual information [
17] to measure the amount of information shared by (features and objects) as well as (features and backgrounds); the network is trained with an additional VIB loss.
The idea behind the proposed method is quite similar to the concept in TCL2021 [
18] regarding feature learning. TCL2021 [
18] uses a trajectory with sequential frames to train the detector. The frames are selected before and after a car moves in/out of a slot; hence, they have pairwise vacant and occupied samples that share the same background but diffident vacant states. As shown in
Figure 2, the pairwise samples in a trajectory had similar appearances; the only difference between the samples is at some pixels where the car is parked. These pairing data function as a contrastive loss where two samples in a trajectory are forced to move far from each other; hence, the model focuses on necessary pixels and eliminates all background pixels. While many promising results have been reported in TCL2021 [
18], this method relies on a motion classifier to prepare training trajectories. Moreover, the training process needs a task-consistency reward to train the vacant-space detector in a reinforcement manner. This makes the training process more complex and requires powerful hardware to store all frames in a trajectory for every iteration update.
Unlike the complex training process in TCL2021 [
18], our work learns crucial features using a simple process. Here, the model includes a feature extractor and a classifier. Given an input image
x, the extractor extracts the feature
, and the corresponding variance
. As the classifier must work well with variant features, we use a reparameterization trick to add some uncertainty to the training process. In detail, a latent feature
z is sampled from the distribution
; then, the vacant state
y is predicted from sampled feature
z by
.
A good feature
z must represent occlusion patterns that predict well vacant states. Hence, the mutual information
should be maximized. Additionally, the background information from input
x should be eliminated in
z. Hence, the mutual information
should be small. Two constraints are optimized together in a variational information bottleneck (VIB) loss [
19]. Since this loss could be integrated into any supervised learning framework, the training process becomes simpler than task consistency learning in TCL2021 [
18].
In short, the contributions of this paper are the following:
We propose a learning method to learn better features for vacant parking space detection. The trained model can work better on the unseen domains, while no data from the unseen domain are available in the training phase;
A reparameterization trick is used to learn a classifier that is well adapted to environmental changes;
A variational information bottleneck loss is used to learn features focused on occlusion patterns and eliminate the background.
2. Related Works
Vacant space detection for vehicle parking has been an interesting research topic for many years. Early research Lin 1997 [
6], Raja 1999 [
20], Raja 2000 [
7], and Raja 2001 [
21] relied on feature engineering processes to detect a car in a specific position. Given a selected location, Lin 1997 [
6] introduces geometrical models with a spatial contour matching approach and a careful tuning process to fit well the specific scenario. Raja 1999 [
20], and Raja 2001 [
21] proposed a learning-based approach to model the unknown distribution of images that contain vehicles by utilizing the higher-order statistics (HOS) information as the feature. Raja 1999 [
7] advanced to detect and track vehicles in videos by combining statistical knowledge about the visual appearance of vehicles with their motion information. However, most of the work focuses on a few slots but does not use only one camera to manage all slots in a wide parking area. This may limit particular applications.
With the development of CCTV systems, applying vehicle detection algorithms in the parking lot for vacant space detection has become possible in the recent decade. Therefore, much research has been conducted to pursue desirable performance and overcome challenging issues in practice (Paolo 2013 [
22], Lixia 2012 [
23], Wu 2007 [
24], Huang 2010 [
25], and Huang 2013 [
26]). Here, images captured by CCTV cameras may include many slots. A detector must address both localization and classification tasks. Since a car should be parked in a specific 3D slot, prior 3D information could be used to address the localization task. In Huang 2010 [
25], geometry and image information are fused to generate all vacant states of all slots. Additionally, neighbor slots help to correct the prediction at a query slot via Bayesian inference. Here, six 3D surfaces represent the 3D cube, and each surface is projected to a 2D image to find a corresponding region. Each region is processed independently, and a hierarchy Bayesian inference fuses predictions from these planes to a final vacant state. The Bayesian hierarchical structure can model the occlusion relationship among all neighboring slots to improve accuracy. However, the inference time of Huang 2010 [
25] is too long, and the computing cost needs to be lowered to fit the real-world application. Later, in Huang 2013 [
26], the Bayesian hierarchy is replaced by a multi-layer inference framework to learn the correction process. While Huang 2013 [
26] presented multi-processing steps, Huang 2015 [
27] models each processing step as a layer of a unique model. It helps the solution of Huang 2015 [
27] to be simpler than the Bayesian hierarchy Huang 2013 [
26]. Although considerable results are achieved, these methods require experts to tune hyper-parameters manually for each inference layer.
With the rapid development of deep learning techniques, many applications are solved using the deep model. The multi-layer in Huang 2015 [
27] could be replaced by a unique deep learning model in Huang 2017 [
13]. Each layer in Huang 2015 [
27] is modeled by a specific block; later, the network is trained end-to-end. Following the experimental result in Huang 2017 [
13], a detector may make a wrong prediction if a car is too big or parks not at the center of a slot. Therefore, in Huang 2017 [
13], a spatial transform network is used to select a suitable region of the input image for classification. To ensure the input image can cover all necessary information so that a spatial transform could select a suitable ROI, the authors of Huang 2017 [
13] used a normalized three-parking image to represent input and gain higher accuracy in real-time. Next, Vu 2019 [
12] provides a contrastive loss to learn better feature representation. Using a deep learning framework, the methods in Huang 2017 [
13] and Vu 2019 [
12] perform better than traditional approaches without fancy feature engineering processes or heavy hyper-parameter adjustment. To deploy the vacant detector in a new parking lot, Zeng 2022 [
28] further generalized the deep learning-based approaches to significantly different lighting scenarios with the adversarial domain adaptation technique. Both Vu 2019 [
12] and Zeng 2020 [
28] reused the normalized three-parking-space image setting in Huang 2017 [
13] and obtained high-accuracy results.
Although promising results were reported in Vu 2019 [
12], and Zeng 2022 [
28], the success of these deep learning approaches comes from a vast dataset. Unfortunately, due to the domain gap issue, using these approaches requires labeling once for each parking lot, which causes enormous manual effort. To solve this issue, TCL2021 [
18] proposes using an optical-flow-based motion classifier as the guidance to train the vacant space detector. Specifically, TCL2021 [
18] found that RGB-image-based models are affected seriously by camera poses or various lighting conditions. By contrast, the motion classifier, which takes a sequence of optical flow to predict the motion state of a trajectory, is robust to such factors. Therefore, the authors train a vacant space detector using consistency with a flow-based motion classifier. The motion classifier could be trained in a parking lot; later, it helps to train vacant detectors in another parking lot. Using consistency between the two models, the vacant space detector on a new parking lot could be trained without human labor. However, the motion classifier may make a wrong prediction with high confidence; this phenomenon seriously affects the trained detector. Moreover, estimating a consistency reward is complicated, and this method requires high-capacity hardware to store all frames in a trajectory.
4. Results
4.1. Datasets and Implementation Detail
Two training datasets are collected on the source domain in
Figure 1. The first training dataset
is a standard training dataset collected over 30 days and labeled for supervised learning. We collect one image every 30 min and only in the daytime—from 6 a.m. to 6 p.m. The dataset includes 14,667 vacant slots and 20,021 occupied slots. The second training dataset
includes samples collected from trajectories as in TCL2021 [
18]. In detail, TCL2021 [
18] uses the magnitude of optical flow to detect a time slot when a car is moving. Given the optical flow in the time slot, a motion classifier is used to estimate the motion state of this car. On both sides (before and after) of the time slot, we have no-motion segments in which no car moves, as in
Figure 4. We sample one sample for each no-motion segment. Then, two samples are added to the
dataset. As a result of the sampling method, the pair samples have similar backgrounds but only differences in the middle slot, as shown in
Figure 2 and
Figure 4. This dataset includes 2000 samples from 1000 trajectories. Hence,
is a vast and diverse dataset that may make it easy to train a vacant space detector; in contrast,
is small but has many similar images in the dataset. This property allows
to work as a contrastive loss where pairwise samples are trained together.
In the testing phase, two datasets are prepared. The first dataset includes 11,944 vacant samples and 7832 occupied slots collected from the other 15 days on the source parking lot from 6 am to 6 pm. The second dataset is of the same scale as the dataset but it is collected from the target domain. We use the two testing datasets to evaluate the effects of the domain gap and the generalization of models trained by the proposed method.
We follow the suggestion in [
32] to implement the VIB method. However, instead of using a sum operator to estimate the loss over training patches, we use a mean operator. Given an input
x and the corresponding latent
, the sum operator estimates KL loss
d times, but the classification loss is only estimated once for each sample. In this situation, gradients from the KL loss may destroy the original backbone. Using the mean operator, the losses are averaged on both the feature and patch size dimensions. Hence, the model can converge smoothly.
In addition, the training hyper-parameters are listed below:
Optimizer: SGD with learning rate = 0.01.
Scheduler: step_size = 30 and gamma = 0.1.
Data normalization: mean = [0.485, 0.456, 0.406]; std = [0.229, 0.224, 0.225].
4.2. Ablation Study
In this section, we discuss the contribution of the KL loss on feature learning. The quality of features depends on the quality of the training dataset. Therefore, we randomly select a subset of to train the detector. Denote as , , , , and subsets from . The number of samples in , , , , and is 2000, 5000, 10,000, 20,000, and 34,688, respectively. As the dataset has 2000 samples, it is equivalent to in scale. However, differences are in the content of where pairwise vacant/occupy images share the same background.
We first use subsets of
to evaluate the generalization of a model. In this scenario, the training dataset is diverse. The accuracy values for
and
are in
Table 2 and the F1 score for
and
are in
Table 3. During the training process, we evaluate the model using
for every epoch. The accuracy for
is used to select the best model in the training phase.
Regularization methods (e.g., batch normalization, dropout) are possible solutions to avoid overfitting on a dataset and learn generalization features. Hence, we use batch normalization and dropout on the conventional supervised learning method to evaluate regularization methods on the application. Our experiment is based on the VGG model to detect vacant space. The VGG model includes a feature extractor, a neck, and a classifier. Hence, we add a batch normalization after the feature extractor and denote this setting as “No VIB (BN)” in
Table 2 and
Table 3. In the classifier module (VGG network), the dropout layer had been applied with
p = 0.5 by default. Hence, the default setting is denoted as “No VIB (
p = 0.5)”. To attain better generalization features, we test the performance when
p = 0.7 and
p = 0.9. The notations of these settings are “No VIB (
p = 0.7)” and “No VIB (
p = 0.9)”, correspondingly.
Following the result in
Table 2 and
Table 3, several conclusions could be drawn:
The domain gap is a critical challenge in deploying a new parking lot. In
Table 2, using the
dataset (5000 samples), the conventional method (No VIB (
)) can achieve 99.46% accuracy on
but it can only achieve 89.68% accuracy on
;
To learn generalized features, a vast dataset is needed. In the scenario “No VIB (
)” in
Table 2, using the
dataset (30,540 samples) for training does not help to increase accuracy when evaluating on the same domain (
). However, the performance increases significantly when evaluating is based on the target domain (
). In detail, when the
dataset is used for training, the accuracy on
is only 89.68%; however, when the
dataset is used for training, the corresponding accuracy is 92.02%. A similar observation can be found in
Table 3;
The proposed method helps to increase the performance on the source domain if the training dataset is sparse. When the training set is dataset (2000 samples), the accuracy and F1-score are 97.95% and 97.54%, correspondingly, on . When the KL-loss () is introduced, the performance improves to 98.44% for accuracy and 98.14% for the F1-Score on the source domain;
When the number of training samples increases, the proposed method helps to learn better features that work well in the target domain. The conventional model (No VIB ()) can reach up to 88.95% on the dataset. However, there are no apparent differences between the accuracies given by the (only 10,000 samples) and datasets. On the target domain, the accuracy given by is 88.84% and the accuracy given by is 88.95%. In contrast, given the datasets, the proposed method can achieve 97.91% accuracy and 97.31% F1 score (). Moreover, the differences in accuracy are apparent if more training samples are available. We can see that the performance inmproves if the number of training samples is increased. In further detail, when are applied for training, the accuracies are 92.93%, 94.73%, 95.28%, 96.51%, 97.91%, correspondingly;
The hyper-parameter should not be too large. When , the performance is reduced on . This is reasonable because the KL-loss forces a feature to be closer to zero. This means that more spatial regions are not used to predict the vacant state. Following our experiment, is an optimal selection in most datasets;
Batch normalization may not help in the vacant space detection application. Even if the performance on the source domain is high, the model cannot work better on the target domain. The maximum accuracy on the target domain is only 83.53% if BN is applied;
The increment of the dropout parameter may help with small training sets. If or serve as training sets, the performance improves on both source and target domains. Especially when the dataset is used for training and the dropout ratio is 0.9, the conventional method performs better than our proposed method. However, when the dataset is larger, the performance cannot improve. The accuracies on the , , , and datasets are quite similar if p = 0.5 or p = 0.7. In contrast, our VIB-based method can also learn and accept new features from larger training datasets.
We also provide experiments to evaluate the proposed method if the training dataset includes pairwise samples. Here,
serves as a training set and the testing sets are
and
as in previous experiments. Note that pairwise samples in
serve as a contrastive loss during the training phase. The result in
Table 4 shows that a pairwise dataset may help to learn better-generalized features. Compared with the result given by the
dataset, the result given by the
dataset is slightly better. The accuracy scores are 98.04% and 97.95% if the training datasets are
and
, correspondingly. Here, we compare the two datasets because they are on the same scale. The pairwise dataset [
18]
could help to slightly increase performance. This phenomenon could also be observed when
is the testing set. Hence, the pairwise property in
can force the model to learn generalized features.
When applying VIB, the performance on the source domain (
) is not improved. The accuracy is 98.04% without VIB and it is 98.97% with VIB. This means
does not have enough data to learn a better detector. This conclusion is consistent with the result in
Table 2 when
is the training set. However, the performance of the target domain is improved significantly. Without VIB, the accuracy on
is 85.88% but it could increase to 94.82% when
. In this case, the increment is 8.94%. If the
dataset is used as a training set, the increment is 8.75%. This means that VIB loss can help in both types of datasets.
4.3. Comparison with State-of-the-Art Methods
4.3.1. Comparison with Supervised Learning Methods
In addition, we compare our method with other conventional vacant space detectors, including [
12,
13,
18,
22,
23,
24,
33]. The parking lot in our paper is similar to the parking lot in Vu 2019 [
12] and TCL2021 [
18]. However, Vu 2019 [
12] prepared a vast dataset that includes 587,667 samples and TCL2021 [
18] used a motion classifier to select pairwise samples. The two papers and our paper use a normalized three-parking slot as the input of detectors. The authors of Vu 2019 [
12] also used their dataset to evaluate Paulo2013 [
22], Lixia 2012 [
23], Wu 2007 [
24], Huang2017 [
13], and Faster-RCNN [
33] methods in a comprehensive comparison. From a dataset viewpoint, our dataset is quite similar to the dataset in Vu 2019 [
12] but at a smaller scale. From the training process viewpoint, Vu2019 [
12] does not use any pre-trained model, but the proposed method uses the pre-trained VGG model. We also prepared a simple version of the proposed method that does not use VGG as a pre-trained model.
The comparison results in
Table 5 show that our method can outperform these vision-based methods as [
22,
23,
24]. Moreover, we also compared with deep-learning-based methods [
12,
13,
33]. Faster-RCNN [
33] works for vehicle detection, but it cannot detect well occluded parked cars with small image sizes. In addition, Faster-RCNN needs to address both the localization and classification tasks, while Huang 2017 [
13] and Vu 2019 [
12] use 3D information to solve the localization task. Hence, their performances are better. Huang 2017 [
13] and Vu 2019 [
12] can provide higher accuracy for vacant space detection, but they require a vast training dataset with supervised labels. In comparison, our method also achieves an equivalent accuracy with only 5000 samples. However, Huang 2017 [
13] and Vu 2019 [
12] do not use any pre-trained model, but our method uses a VGG pre-trained model. For a better comparison, we also train our method from scratch with only 5000 training samples. The result shows that without the help from the VGG pre-trained model, the accuracy is reduced by 1%. This degradation is relatively small and could be compensated for when the training dataset is larger. TCL [
18] uses 1000 motion trajectories to achieve a good performance, but their training process is complex because of the task consistency reward. Additionally, it requires more RAM in GPU to store all frames in a trajectory. Compared with TCL [
18], our proposed method is easier to train.
4.3.2. Comparison with an Upper Bound
In our work, we train and evaluate the vacant space detector on the source domain; additionally, we test it on the target domain. During the training process, no information from the target domain is used. To evaluate our learning method’s performance, we compare the proposed method with unsupervised domain adaptation [
28]. In unsupervised domain adaptation [
28], the target model uses a source dataset and a target dataset during training phases. However, the source dataset includes label and image information, whereas the target dataset only has image information. Therefore, the unsupervised domain adaptation [
28] could be treated as an upper bound of our proposed method.
In this experiment, two cameras at two angle views are set up to collect two datasets. One dataset is the source dataset and another is the target dataset. Each dataset is split into a training dataset and a validation dataset in its domain. The results in
Table 6 show that when ‘
view’ is used as a source dataset, the performance on the target dataset (‘
view’) reaches the upper bound. However, if ‘
view’ serves as the source dataset, the performance on ‘
view’ is far from the upper bound. This means that generalized features rely greatly on the source domain.
4.4. Feature Analysis
In this section, we analyze the feature maps extracted by the proposed method (with VIB) and the conventional method (without VIB). Our model is based on the VGG model that includes a feature extractor, a neck, and a classifier. Given an image, we use the feature extractor to extract feature maps. There are 512 feature maps given one input image.
Figure 5 shows one example image and several corresponding feature maps in both cases (with and without VIB). The result shows that the VIB-based method extracts sparse feature maps. This means that the features learned by VIB are precise and cannot be found everywhere on the example image even though cars are at every neighbor slot. Only some spatial locations have a response on the feature map yielded by the VIB-based method. This phenomenon is reasonable because the KL loss force feature maps close to zeros. Similar to the dropout technique, a sparse feature map may avoid overfitting in a dataset and the model becomes generalized. This observation also proves the benefit of the proposed method where the model can serve as a feature selector which:
Figure 5.
Feature maps extracted by the detector with and without VIB loss.
Figure 5.
Feature maps extracted by the detector with and without VIB loss.
In addition, the proposed method can not only provide sparse feature maps but also extract more empty feature maps where all values are zeros. Given one feature map, we extract the feature map’s minimum, maximum, and mean values. This process is applied to 512 channels. The statistics of these variables are in
Table 7. Each column represents one variable (maximum, minimum, and mean value of a feature map), and each row represents one statistics indicator.
By comparing the statistics of the maximum variable between the two methods (with and without VIB), we may see that the number of empty feature maps in the VIB-based method is larger. With VIB, more than 50% of feature maps are empty maps; without VIB, the corresponding value is smaller than 50%. In addition, there is not a huge distance between VIB-based features. With VIB, the maximum values of minimum variables and maximum variables are 7.3204 and 12.6495. Without VIB, the maximum values of minimum variables and maximum variables are 1.2250 and 30.8568. Considering the maximum value, the gap between maximum and minimum variables is 5.3291 and 29.6318 with VIB and without VIB. This means that VIB can serve as a normalization process that reduces the feature variance in feature space.
Following the above discussions, VIB has feature normalization and sparse representation properties. Normalization and sparse representation are well-known solutions for a better generalization model. Hence, the proposed method can learn generalization features. In addition, during the training process, some uncertainty was added to the feature by the reparameterization step. This uncertainty may model some domain shift factors (orientation, camera field of view, camera height) but in feature space. Therefore, the model has the ability to adapt to environmental change.