1. Introduction
The problem of parking has become increasingly problematic as the number of cars on the roads has increased, particularly in urban areas. Therefore, there is a strong demand for effective parking lot management systems that can address these problems in real time. With limited parking availability and the ever-growing number of vehicles, traditional parking management approaches are proving inadequate in ensuring optimal space utilization and reducing congestion. Deep learning techniques, particularly convolutional neural networks (CNNs), have gained attention for their potential to transform parking management. These methods offer the promise of accurate occupancy detection, which is fundamental for making informed decisions regarding space allocation, traffic flow optimization, and overall urban planning.
Several studies have proposed deep learning techniques for parking lot management, with a specific focus on three types of problems: automatic parking space position detection, individual parking space classification, and vehicle detection and counting [
1]. The motivation behind developing a model for parking lot occupancy detection is to address the need for efficient management of parking spaces. By accurately determining the occupancy status of parking lots, it becomes possible to optimize parking resource utilization, enhance traffic management, and improve the overall parking experience for users.
However, the successful integration of deep learning in parking management necessitates a profound understanding of the unique challenges posed by this domain. Parking scenarios introduce complexities such as varying lighting conditions, diverse vehicle types, occlusions, and the requirement for real-time response. These challenges demand tailored solutions that can reliably function across a spectrum of conditions, providing accurate occupancy detection while accommodating the dynamic nature of parking environments. Existing methods for parking lot occupancy detection often rely on conventional computer vision techniques or shallow machine learning models, which struggle to achieve high accuracy in complex parking scenarios. These methods lack the ability to handle some of the aforementioned problems.
In this paper, we address these challenges by proposing an enhanced MobileNetV3 architecture customized for the nuanced demands of parking lot occupancy detection. By leveraging the architectural efficiency of MobileNetV3 [
2] and introducing domain-specific modifications, we aim to mitigate the complexities inherent to parking management scenarios. Although the MobileNetV3 architecture has demonstrated significant efficiency gains and high accuracy in various computer vision tasks, its application to parking lot occupancy detection poses unique challenges. In the context of parking lot occupancy detection, the original MobileNetV3 encounters limitations related to handling varying lighting conditions, dealing with occlusions, and distinguishing between different vehicle types. These challenges stem from the specific characteristics of parking lot images, including complex backgrounds, varying perspectives, and the need to accurately identify small, partially occluded objects. Our research addresses these limitations by introducing key modifications to the MobileNetV3 architecture tailored for parking lot occupancy detection. This modified version incorporates several architectural improvements, including the use of a Leaky-ReLU6 [
3] activation function for the shallow part of the MobileNetV3 model, the replacement of the squeeze-and-excitation module [
4] with the convolution block attention module [
5], and the replacement of the depth-wise separable convolutions with blueprint separable convolutions [
6]. We treat the automatic detection of vacant spaces as a binary classification problem and train and test the improved model on widely used parking management datasets such as CNRPark-EXT [
7] and PKLOT [
8]. The proposed model processes individual parking spaces and classifies them as vacant or occupied. The incoming real-time video feed frame is processed to obtain individual parking spaces. The proposed model exhibits superior performance compared to previous state-of-the-art models in terms of accuracy and precision and demonstrates its capability to function in real time.
The industrial significance of our approach lies in its practical applications within the rapidly growing field of smart cities and intelligent transportation systems. Our modified MobileNetV3 model addresses key challenges in parking management, contributing to reduced congestion, improved user experiences, and optimized parking resource utilization. With real-time and accurate parking occupancy detection, cities can implement responsive parking guidance systems, enabling drivers to quickly locate available parking spots.
The main contributions of this study are as follows:
Novel model outperforming state-of-the-art models: We propose and develop a novel model that achieves a substantial advancement over existing state-of-the-art models in terms of both accuracy and AUC score. Importantly, this superior performance is achieved while ensuring real-time functionality, making our model highly suitable for practical applications.
Enhancements to MobileNetV3 architecture: We enhance the performance of the MobileNetV3 architecture through a series of strategic modifications. Firstly, we introduce a novel activation function that contributes to improved accuracy and precision. Additionally, we replace the traditional squeeze-and-excitation (SE) module with a Convolution Block Attention Module (CBAM), a change that refines the model’s ability to focus on salient features. Moreover, we optimize the depth-wise convolution block by adopting blueprint separable convolutions, resulting in a model architecture that is more efficient and effective for parking management tasks.
Improved generalization and small object detection: Our enhanced MobileNetV3 model exhibits notable improvements in its architecture. These modifications empower the model to better identify essential aspects of images, pay attention to small objects within the image, and achieve increased generalization capability. These enhancements collectively contribute to superior performance in parking lot occupancy detection tasks.
Practical significance: The contributions outlined above hold significant implications for real-world parking management scenarios. Our model’s elevated accuracy, coupled with its capacity for real-time operation, has the potential to revolutionize parking lot occupancy detection. By honing in on crucial image components and effectively detecting small objects, our model proves to be a valuable asset for optimizing parking resource utilization, alleviating traffic congestion, and ultimately enhancing the efficiency of parking management systems.
The remainder of this paper is organized as follows.
Section 2 reviews the literature concerning the MobileNet models’ family and parking space classifications.
Section 3 describes the datasets used in the experiments.
Section 4 and
Section 5 discuss the proposed parking management approach and present the experimental results and analyses, respectively.
Section 6 provides an overview of the research findings and suggests potential areas for future investigation.
4. Proposed Method
This section examines the development process of the deep learning-based parking lot occupancy detection system and its constituent components. We use the LeakyReLU6 activation function for the shallow part of the model, replace the SE block with a convolution block attention module, and replace the depth-wise convolution layers with blueprint separable convolutions. The logical architecture of the occupancy detection process with an already trained model is presented in Algorithm 1.
Algorithm 1. Pseudocode for parking lot occupancy detection process. |
Input: images of streaming camera Input: manually entered parking space locations Set classification threshold → T When the streaming video does not stop, for each frame of the video:
End while
|
4.1. LeakyReLU6 Activation Function for the Shallow Part of the Network
The use of activation functions is an important aspect of deep learning models. Activation functions introduce non-linearity into the network, thereby allowing it to learn more complex and abstract features from the input data. The authors of MobileNetV3 used the ReLU6 activation function as part of the h-swish activation function. ReLU6 is a popular activation function that is frequently deployed in neural networks because it is computationally efficient and can prevent the vanishing gradient problem.
However, ReLU6 has the limitation that it remains inactive for negative input values, which can result in inaccurate feature extraction. To address this limitation, the Leaky-ReLU6 activation function is used in this study. The Leaky-ReLU6 function combines the leaky-ReLU concept with the ReLU6 function to form a new activation function that is divided into three segments.
When x is less than zero, the function is multiplied by a small parameter, ‘a’, to prevent the neuron from dying. This allows for more effective feature extraction in the low-level network. When 0 < x < 6, the function grows linearly; when x reaches 6, it remains at 6 and does not increase further.
The use of Leaky-ReLU6 in the shallow part of the MobileNetV3 model can help improve the accuracy of image feature extraction, particularly for negative input values. The parameter ‘a’ can be manually adjusted during the training process to find the optimal value for the best performance; this value can be used in subsequent test executions.
During our experiments, we tested values in the range [0.0001:0.1]. When a was equal to 0.001, the observed performance was better than the other experimental values.
4.2. CBAM Attention Mechanism
In computer vision, the attention mechanism is a technique that focuses on specific regions of an image that are most relevant to a given task or objective. It is inspired by the manner in which human attention works, where we tend to focus on the most informative or interesting parts of an image. In an attention mechanism, a model learns to assign importance weights to different parts of an image and then selectively combines these features to make a prediction or decision. This can improve the accuracy and efficiency of a model because it allows it to pay attention to the most important details while avoiding unimportant or distracting details in a picture. Attention mechanisms have been demonstrated to enhance the performance of these models in several computer-vision tasks, including image classification, object identification, and image captioning.
The attention module in MobileNetV3 is called the squeeze-and-excitation (SE) module. It comprises two main operations: squeeze and excitation. In the squeeze operation, the feature maps from the previous convolutional layer are globally averaged and pooled to produce a 1D feature vector that represents the channel-wise statistics of the feature maps. During the excitation operation, this 1D feature vector is passed through two fully connected layers using a gating mechanism, producing a channel-wise importance score vector. This vector is then multiplied with the original feature maps to produce the attended feature maps, which emphasize the informative channels and suppress the less informative ones.
The SE module is designed to adaptively adjust the channel-wise importance of feature maps, which enhances the discriminability of features and boosts the performance of object-detection tasks. It has been shown to perform well in a range of computer vision tasks such as semantic segmentation, object detection, and image classification. However, the SE module concentrates solely on the channel dimension of the feature map while overlooking the spatial dimension of the target data. In contrast, the convolution block attention module (CBAM) creates an attention map in both the channel and spatial dimensions and conducts element-wise multiplication operations between the attention map and input feature map in the corresponding dimensions. This results in a more comprehensive and accurate extraction of the target features.
The CBAM channel attention mechanism is characterized by a greater number of parallel global max pooling layers than the SE module. In addition, the utilization of diverse pooling operations enables the extraction of more comprehensive, high-level features. Within the bottleneck structure of the parking space classification model, the input channels undergo a dimensional upgrade and deep convolution, obtaining feature F through deep convolution; this feature is input into the channel attention module of the CBAM to derive the channel feature. The resulting channel feature F’ is then multiplied with F to obtain the feature F’, which is fed into the spatial attention module to produce the spatial feature. The final feature F’’ is obtained by multiplying the channel feature F’ and the spatial feature, followed by linear point-by-point convolution.
Figure 9 shows a schematic diagram of MobileNetV3’s bottleneck structure with an integrated CBAM module.
4.3. Blueprint Separable Convolutions to Replace Depth-Wise Separable Convolutions
As discussed in
Section 3.2, depth-wise separable convolutions are used in MobileNetV3 to reduce the number of parameters and computational complexity while maintaining accuracy. Traditional convolutional layers have a large number of parameters, which can lead to slow inference times and high memory usage. In MobileNetV3, the use of depth-wise separable convolutions, along with other optimizations, such as SE blocks and hard-swish activation functions, results in a highly efficient and accurate neural network architecture for mobile and embedded devices. However, Haase and Amthor [
6] quantitatively analyzed the properties of kernel weights obtained from trained models and found that depth-wise separable convolutions indirectly rely on correlations between kernels; however, their proposed new approach, blueprint separable convolutions, utilizes intra-kernel correlations to enable a more effective separation of standard convolutions, as opposed to traditional convolutional neural networks that rely on inter-kernel correlations. This results in a more efficient and effective convolution method.
Blueprint separable convolutions are a type of convolutional neural network layer introduced by Haase and Amthor [
6] that aims to improve the efficiency of depth-wise separable convolutions by exploiting the interrelationships between CNN kernels along their depth dimension. Depth-wise separable convolutions employ M × K × K filters that can be represented by a K × K template and M parameters that distribute the template in the depth dimension; this observation has motivated the creation of blueprint-separable convolutions. Every filter kernel F(n) can be depicted using a blueprint B(n) and the weights wn, 1, …, wn, M via
with m in {1, …, M: number of kernels in one filter} and n in {1, …, N: number of filters in one layer}.
Figure 10 illustrates the blueprint separable convolutions and their differences from standard convolutions. Blueprint separable convolutions exploit the CNN kernel correlations along their depth axes. Consequently, each filter kernel is represented as a single two-dimensional blueprint kernel in blueprint separable convolutions, which are then distributed along the depth axis using a weight vector. Although filter kernels are subject to strict limitations under this formulation, the authors experimentally showed that, when compared to their vanilla equivalents, CNNs trained using blueprint separable convolutions can achieve the same or even higher quality.
Compared to standard convolution layers that have M×N×K2 free parameters, blueprint separable convolution only has N×K2 parameters for the blueprints and M × N parameters for the weights. The authors proposed two versions of blueprint separable convolutions: unconstrained blueprint separable convolutions (BSConv-U) and subspace blueprint separable convolutions (BSConv-S).
When compared to DSConv, BSConv-U has depth-wise and point-wise convolution layers in opposite order, in which intra-kernel correlations are promoted more than cross-kernel correlations. BSConv-U is less complex in terms of the mathematical equations and calculations, making it more suitable for practical implementation.
Reversing the order of the layers is not expected to significantly affect the middle flow of the network because it already includes point-wise and depth-wise convolutions in an alternating pattern. However, the entry flow is affected because the feature maps from the initial regular convolution can be more fully utilized by the depth-wise convolution via the preceding point-wise distribution. The authors experimentally demonstrated that CNNs trained using the BSConv method can achieve comparable or even superior quality compared to their conventional counterparts.
Overall, the improvements in the architecture of the proposed model helped prevent the model from overfitting, decreased the inference time, and improved accuracy.
4.4. Implementation Details
The proposed classification model was trained using a personal computer with an 8-core 3.70 GHz CPU, 32 GB Memory, and Nvidia GeForce RTX 3060 GPU. The training and testing processes utilized two commonly used parking lot datasets: PKLot and CNRPark-EXT. During our experiments, we used predefined training, validation, and testing subsets of the CNRPark-EXT dataset: the training subset contains 104,493 patches from both the CNRPark and CNRPark-EXT dataset training subsets; the validation subset contains 21,231 patches from both the CNRPark and CNRPark-EXT datasets; and the testing subset contains 31,825 patches from the CNRPark-EXT dataset testing subset. From the PKLot dataset, we used the PUCPR (424,269 patches), UFPR04 (105845 patches), and UFPR05 (165,785 patches) subsets alternatively as our training and testing subsets. The crucial parameters for the training experiments are as follows: 500 epochs, a batch size of 64 images, and a 224 × 224 input image size. Using a starting learning rate of 0.0001, weight decay of 0.0005, and momentum of 0.99, we employed the Adam optimizer, which combines the benefits of two other optimizers: the adaptive gradient algorithm (AdaGrad) and root mean square propagation (RMSProp).
Using five-fold cross-validation, we separated the dataset into five sections and used 80% of it for training and the remaining 20% for validation throughout the training phase. Shuffling was performed at every epoch. Our trained model performed well when tested on an untested sample of photographs.
We used accuracy and AUC scores as our main metrics in this work. Below, we present the formulas used to calculate the accuracy and precision:
where TP, FN, FP, and TN represent the number of true positives, false negatives, false positives, and true negatives, respectively.
AUC score: a metric commonly used to evaluate the performance of binary classification models, such as those used in machine learning and deep learning. The receiver operating characteristic (ROC) curve is a graphical representation that illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate (1 specificity) at different probability thresholds. The AUC represents the area under the ROC curve, which is a single value ranging from 0 to 1. The AUC score in our work measures the model’s ability to distinguish between occupied and unoccupied parking spaces.
5. Experimental Results and Analysis
In this section, we analyze and compare the results of our proposed model with those of other classification models developed for parking lot classification, such as mAlexNet, CarNet, and others, in terms of classification accuracy and AUC score. The experiments show that our proposed modified MobileNetV3 model has a higher classification accuracy than other models and that our proposed model correctly classifies and categorizes more empty and busy parking spaces than other models.
We tried to visualize what our model learnt during the training process and used GradCAM [
22] and feature visualization [
23] methods to check if our model was learning the right features and paying attention to the right part of the image. In
Figure 11, samples are given for this process. GradCAM helps by visualizing which parts of the image the model is paying the most attention to.
In
Figure 12, we demonstrate the sample parking lot classification result performed with our proposed model. As is visible in the figure, all the parking spaces are correctly classified as busy or vacant, which shows the accuracy of our model.
As an ablation study, we trained the original MobileNetV3 model from scratch on PKLot and CNRPark-EXT datasets and tested the model on both datasets, and the same process was applied to four different models: MobileNetV3 with the proposed LeakyReLU6 activation function, MobileNetV3 with its SE mechanism replaced by the CBAM attention mechanism, MobileNetV3 with its depth-wise separable convolutions replaced by blueprint separable convolutions, and MobileNetV3 with all the above modifications applied. The goal of these experiments was to detect which modification made to the original model brought the greatest increase in accuracy and made the model more generalized and scalable to different parking areas. The results are summarized in
Table 3.
From
Table 3, it is evident that although the original MobileNetV3 model achieved nearly 100% accuracy on the same training and testing subsets of the PKLot dataset. But, when trained on one subset and tested on another, the accuracy of this model dropped, which means that it overfit the dataset. When the model was trained on the UFPR05 dataset and tested on two different subsets, its performance was not good, achieving accuracy rates of 87.80% for PUCPR testing and 88.25% for UFPR05 testing. However, changing its shallow part activation function, changing its attention mechanism, and replacing depth-wise separable convolutions with blueprint separable convolutions helped the model avoid overfitting and achieve high accuracy on all training and testing parts.
Substituting the ReLU6 activation function with LeakyReLU6 resulted in a reduction in overfitting of approximately 2% within identical training and testing dataset scenarios. Introducing the CBAM module in lieu of the SE module led to a noteworthy accuracy enhancement from 87.80% to 92.64% for the UFPR05/PUCPR case and from 88.25% to 91.78% for the UFPR05/UFPR04 scenario. Conversely, replacing DSConv with BSConv yielded the most significant improvement in accuracy among the three architectural modifications. In the case of training and testing on the same subset, the accuracy nearly approximated that of the original MobileNetV3, while successfully mitigating overfitting. Moreover, for the UFPR05/PUCPR and UFPR05/UFPR04 cases, the model’s accuracy exhibited improvements of 6% and 5%, respectively. The best classification results were achieved when all modifications were applied to the model, which was expected regarding the modifications to the model structure and their effects on the model’s performance.
Figure 13 presents the learning curves of five different models in
Table 3 for training on the PUCPR subset of the PKLot dataset.
In
Figure 13, it can be seen that after the final epoch, the training accuracies for the original MobileNetV3 and our proposed approach (MobileNetV3 with all modifications) were 99.95% and 99.9%. Also, this comparison shows that out of all three architectural changes, replacing DSConv with BSConv had more effect on the model’s classification improvement. However, as it was said before, the original MobileNetV3 overfitted the dataset, so it achieved higher accuracy compared to the one we proposed.
We then compared the results of our best model with those of other models developed or fine-tuned with transfer learning, such as AlexNet, mAlexNet, CarNet, VGG16 [
24], VGG19 [
24], and others, on the PKLot dataset. A comparison of the results is presented in
Table 4.
Table 4.
Classification results comparison of our best model with mAlexNet, CarNet, VGG16, and other models on PUCPR, UFPR04, UFPR05 subsets of PKLot [
8] dataset. Bold data shows the highest score for that experiment.
Table 4.
Classification results comparison of our best model with mAlexNet, CarNet, VGG16, and other models on PUCPR, UFPR04, UFPR05 subsets of PKLot [
8] dataset. Bold data shows the highest score for that experiment.
Model | Train | Test |
---|
PUCPR | UFPR04 | UFPR05 |
---|
Our solution: modified MobileNetV3 | PUCPR | 99.90% | 98.20% | 95.15% |
UFPR04 | 98.85% | 99.68% | 98.38% |
UFPR05 | 95.06% | 96.34% | 99.20% |
CarNet [16] | PUCPR | 98.80% | 94.40% | 97.70% |
UFPR04 | 98.30% | 95.60% | 97.60% |
UFPR05 | 98.40% | 95.20% | 97.50% |
mAlexNet [7] | PUCPR | 99.90% | 98.03% | 96% |
UFPR04 | 98.27% | 99.54% | 93.29% |
UFPR05 | 92.72% | 93.69% | 99.49% |
AlexNet [14] | PUCPR | 98.60% | 88.80% | 83.40% |
UFPR04 | 89.50% | 98.20% | 87.60% |
UFPR05 | 88.20% | 87.30% | 98% |
VGG16 [24] | PUCPR | 88.20% | 94.20% | 90.80% |
UFPR04 | 89.70% | 95.30% | 90% |
UFPR05 | 90.50% | 94.90% | 91.80% |
VGG19 [24] | PUCPR | 81.50% | 93.80% | 94.60% |
UFPR04 | 80.40% | 92.30% | 91.90% |
UFPR05 | 88.80% | 95.10% | 95.90% |
Xception [25] | PUCPR | 96.30% | 92.50% | 93.30% |
UFPR04 | 94% | 94.60% | 93.40% |
UFPR05 | 95.70% | 90.90% | 91.20% |
Inception V3 [26] | PUCPR | 90.80% | 91.10% | 94.20% |
UFPR04 | 91.70% | 95.20% | 92.40% |
UFPR05 | 94.30% | 92.90% | 93.70% |
ResNet50 [27] | PUCPR | 88.20% | 94.20% | 94.10% |
UFPR04 | 89.70% | 95.30% | 93.30% |
UFPR05 | 90.50% | 94.90% | 95.50% |
The results presented in
Table 4 indicate that our approach demonstrated superior performance compared to the alternative classification methods across six out of nine experimental scenarios. Notably, our method exhibited higher accuracy rates in the following scenarios: PUCPR/PUCPR (99.9%), PUCPR/UFPR04 (98.2%), UFPR04/PUCPR (98.85%), UFPR04/UFPR04 (99.68%), UFPR04/UFPR05 (98.38%), and UFPR05/UFPR04 (96.34%). Notably, CarNet [
16] exhibited better performance than our proposed model in the UFPR05/PUCPR and PUCPR/UFPR05 scenarios, recording accuracy rates of 98.4% compared to 95.06% and 97.7% compared to 95.15%, respectively. Additionally, in the UFPR05/UFPR05 scenario, mAlexNet [
7] achieved the highest accuracy of 99.49%, whereas our model attained an accuracy of 99.2%. These results show that the modifications to the original MobileNetV3 model are as useful and efficient as expected.
We subsequently repeated the experiments using the CNRPark-EXT dataset. First, we trained five models on the training subset of the CNRPark-EXT and tested them on the testing subset of the dataset: original MobileNetV3, MobileNetV3 with the LeakyReLU6 activation function, MobileNetV3 with the CBAM module, MobileNetV3 with BSConv, and MobileNetV3 with all architecture modifications. The results of these experiments are presented in
Table 5.
The initial MobileNetV3 architecture yielded accuracies of 94.95%, 90.13%, and 93.53% on the training, validation, and testing subsets of the dataset, respectively. The introduction of an alternative activation function resulted in a modest enhancement of approximately 0.5% in accuracy. Meanwhile, the adoption of an alternative attention module led to a notable improvement of 2% in accuracy. Substitution of depth-wise separable convolutions (DSConv) with blueprint separable convolutions (BSConv) yielded a substantial increase of about 2.5% in accuracy.
Figure 14 shows the training process for the five different models in
Table 5 on the training subset of the CNRPark-EXT dataset.
From
Figure 14, it is visible that, as expected, the architectural changes helped the model increase its accuracy. In this dataset, the changes with the biggest accuracy increase were replacing the SE module with the CBAM module and replacing DSConv with BSConv.
After finishing the experiment with different modifications, we compared our best model results with those of the CarNet, AlexNet, and ResNet models on the CNRPark-EXT dataset. A comparison of the results is presented in
Table 6. From
Table 6, we can observe that our model performed better in two out of three tasks in the training and testing subsets of the CNRPark-EXT dataset. Our model’s validation result was also good but slightly lower than that of AlexNet. Our model achieved 97.73% accuracy for the validation subset; AlexNet achieved 97.91% accuracy. The previous state-of-the-art model, CarNet, achieved 97.91% accuracy in the training subset of the dataset, while achieving 90.05% and 97.24% accuracies in the validation and test sets of the dataset.
Finally, we compared our best model with mAlexNet and AlexNet in combination with the CNRPark EXT and PKLot datasets. The test results are provided in
Table 7.
As CarNet was specifically designed for this task, it achieved 97.03% accuracy on average for all three different experiments. AlexNet obtained 94.07% accuracy as it is a good general deep learning architecture. However, mAlexNet achieved only 88.69% accuracy on average for all three different experiments, which shows that mAlexNet achieves very poor results when trained on one full dataset and tested on another, or in the reverse case. The testing scores for the three combinations provided reveal that our model is much more robust, as it can generalize well and learn general features from the datasets.
In
Table 8, the AUC scores for our proposed model and other state-of-the-art models are given and compared. In this table, we include one different model proposed in [
8], which we call PKLot for convenience. Out of nine experiments with different subsets of the PKLot dataset, our proposed model achieved the highest AUC scores in five cases, while the PKLot approach had the highest AUC scores in three experiments, and CarNet achieved the highest AUC score in one experiment when trained on the PUCPR subset and tested on the UFPR05 subset of the PKLot dataset.
Our trained models took around 10 MB memory, which is quite good compared to big models like VGG16, AlexNet, etc. A modified version of mAlexNet proposed in [
15] needs about 10 KB memory, but its accuracy is lower than mAlexNet. mAlexNet, proposed by Amato et al. [
7], needed about 129 KB. So, while our model is bigger than mAlexNet and modified mAlexNet in size, it has better accuracy and AUC score, as shown in the above experiments.
We also compared the average runtimes of our proposed model with those of other models. We randomly selected 1000 224 × 224 images from each of the CNRPark-EXT and PKLot datasets and ran each model on the same machine used for training without GPU acceleration in the PyTorch framework.
Table 9 shows our runtime analysis.
While our model is 6.7 times slower than both mAlexNet and custom mAlexNet models, it is still 3 times faster than the AlexNet model, which makes it applicable in real-world applications.
The overall conclusion is that the improved MobileNetV3 is a fairly robust model when trained on one dataset and tested on another. We are certain that this approach can be applied to real-life scenarios.
6. Conclusions and Future Work
A parking lot occupancy detection approach was developed in this study using a deep CNN classification model, MobileNetV3, with several modifications to its architecture that increased its robustness and accuracy. The developed model was trained on two well-known parking lot datasets: PKLot and CNRPark-EXT. The incoming video stream is processed frame-by-frame, and each frame is split into patches; the modified MobileNetV3 model classifies each patch as being occupied by a car or as an empty parking space. The classification results were integrated into frames with bounding boxes drawn around each parking space. The qualitative and quantitative performances of the proposed system were experimentally compared with those of other established classification models. The evaluation and experimental results revealed that the enhanced MobileNetV3 model achieved high accuracy and outperformed the other classification models in terms of both accuracy and speed. The developed parking-space classification model is efficient and can be applied to real-world scenarios using mobile devices, resource-constrained edge devices, and cameras.
The main contributions of this work are provided below:
An optimal deep learning model was developed to classify parking lot spaces as empty or busy. In the proposed model, the activation function in the shallow part of the model, which requires significant calculations, is replaced by a new activation function that requires less computation. The squeeze-and-excitation attention mechanism applied in the original MobileNetV3 is replaced by another, more effective attention mechanism: the convolutional block attention mechanism. Moreover, because of the hidden cross-kernel correlations in depth-wise separable convolutions, blueprint separable convolutions are used, as they require less computation because they have fewer parameters.
Using the improved MobileNetV3 model, the parking lot occupancy detection approach can precisely detect the number of free and busy parking spaces, despite different weather conditions, lighting, and shadows.
Despite being robust and sufficiently quick for real-world applications, our model still has some shortcomings: the inability to correctly classify images under diverse weather conditions, images that contain a portion of cars, images with unusual parking configurations, images with partial occlusion, and images with unseen objects.
In the future, we plan to continue exploring new methods and changes to improve the accuracy of the classification model, reduce its runtime to make it faster when applied to mobile and edge devices, and make it successfully applicable in the above mentioned cases where the model may fail now. Furthermore, we plan to work on a smart camera containing the proposed system to detect parking lot occupancy, improve its efficiency, and reduce its resource consumption.
Our research can be extended by being integrated into a decentralized smart camera system [
7]. Incorporating the Improved MobileNetV3 into a decentralized smart camera system has the potential to significantly enhance the efficiency, responsiveness, and intelligence of the system. Also, we are working on automatic parking space detection to replace manually labeling the parking spaces.