Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application

Yang, Shuo; Liu, Yisu; Liu, Ziyue; Xu, Changhua; Du, Xueting

doi:10.3390/wevj15100467

Open AccessArticle

Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application

by

Shuo Yang

^1,*,

Yisu Liu

²,

Ziyue Liu

³,

Changhua Xu

⁴ and

Xueting Du

⁵

¹

College of Computer Science, Inner Mongolia University, Hohhot 010031, China

²

College of Life Science and Chemistry, Hunan University of Technology, Zhuzhou 412007, China

³

Company of IVIS, Osaka 530-0001, Japan

⁴

Company of Raying, Hangzhou 310020, China

⁵

Company of Arcadia, Osaka 530-001, Japan

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2024, 15(10), 467; https://doi.org/10.3390/wevj15100467

Submission received: 3 September 2024 / Revised: 25 September 2024 / Accepted: 1 October 2024 / Published: 14 October 2024

(This article belongs to the Special Issue Deep Learning Applications for Electric Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Vehicle logo detection plays a crucial role in various computer vision applications, such as vehicle classification and detection. In this research, we propose an improved vehicle logo detection method leveraging the self-attention mechanism. Our feature-sampling structure integrates multiple attention mechanisms and bidirectional feature aggregation to enhance the discriminative power of the detection model. Specifically, we introduce the multi-head attention for multi-scale feature fusion module to capture multi-scale contextual information effectively. Moreover, we incorporate the bidirectional aggregation mechanism to facilitate information exchange between different layers of the detection network. Experimental results on a benchmark dataset (VLD-45 dataset) demonstrate that our proposed method outperforms baseline models in terms of both detection accuracy and efficiency. Our experimental evaluation using the VLD-45 dataset achieves a state-of-the-art result of 90.3% mAP. Our method has also improved AP by 10% for difficult samples, such as HAVAL and LAND ROVER. Our method provides a new detection framework for small-size objects, with potential applications in various fields.

Keywords:

vehicle logo detection; self-attention; multi-head attention; multi-scale feature fusion

1. Introduction

Recently, the detection of small objects has become an important research topic in the field of computer vision and intelligent perception technology, particularly in intelligent transportation systems (ITSs), where it is required for tasks such as pedestrian detection, vehicle identification, and abnormal event monitoring. Among these, vehicle logo detection has emerged as a crucial task for identifying vehicles, calculating brand exposure, and advancing small object detection research such as Figure 1. It shows the real-scene for logo detection. However, previous research has overlooked the importance of extracting detailed features from small objects, which has severely limited the accuracy and generalization of vehicle logo detection. Consequently, effective extraction and constructed features from small-size objects are crucial methods to solve the vehicle logo detection task.

In recent years, deep learning-based detection methods have encountered challenges in feature extraction and representation tasks [1,2,3,4]. Deep residual networks [5,6,7] are not effective in extracting detailed texture features, which can negatively impact the detection accuracy. Moreover, the absence of a feature monitoring mechanism in the network often leads to difficulty in achieving the desired detection results. Therefore, we focus on constructing robust feature extraction models based on self-attention networks, which can enable correlation feature learning in pixel areas and obtain better object texture information. Our work specifically aims to develop a self-attention network-based detection method for achieving small-size objects for three primary reasons.

Firstly, our proposed model addresses the challenges in small-size object detection, including the impact of complex background noise on text signs and the sensitivity of vehicle logos to lighting and weather conditions. Text signs, such as HAVAL and Jeep, are particularly vulnerable to the influence of the external environment. The feature descriptors from neural networks are unable to effectively learn content with small differences in relevant regional characteristics. Furthermore, vehicle logos are not fixed in a particular position, such as the front of the radiator and car cover.

Secondly, the proposed model tackles the challenges in vehicle logo detection caused by different lighting and weather conditions. These challenges confuse the logo with the characteristics of other objects and cause color deviation due to the sensitivity of the logo’s material to light. To address these challenges, our feature extraction network is designed with generalization in mind during the training process.

Thirdly, the balance between accuracy and speed puzzles the practical application of depth learning-based detectors. The deepening of the network and the application of the visual method based on the transformer mechanism result in a significant reduction in the detection speed. Therefore, our model aims to reduce the memory consumption of network models and improve computing efficiency, improving the overall effectiveness of object detection methods.

In this paper, we focus on developing a feature extraction network based on self-attention and a detection head for small-size objects in vehicle logo. For the feature extraction network, we designed a multi feature fusion residual convolution with pixel attention layer, which can effectively learn the related relationship surrounding the vehicle logo, considering the challenges posed by complex background noise and varying lighting and weather conditions. To achieve smooth sampling of the object and reduce feature loss in the down-sampling process, we cascade multiple residual convolutions. For the detection head, we utilize cross-layer fusion to support the multi-scale prediction layer, which can improve the locating and classification accuracy of small-size objects. The contributions of our model can be summarized as follows:

(A): We propose a balanced object detection method based on self-attention networks, which achieves real-time and higher detection precision for vehicle logos.
(B): We construct a related feature learning model based on the theory of visual transformer and convolution. It utilizes cross-layer fusion and related pixel learning to improve the representational model for small-size objects.
(C): We build a multi-scale prediction detector by fusing shallow layers with deep layers, which takes shallow texture features as important information for locating objects. Experimental evaluation on the VLD-45 datasets proves that our detector has robustness and superiority in detecting small-size objects.

The remainder of our research is organized as follows: Section 2 introduces related work on object detection based on deep learning. Section 3 describes the detailed model for our VLD self-attention method. Section 4 presents the experimental results with comparable methods and ablation analysis. Finally, Section 5 concludes the conclusion and research project.

2. Related Work

With the standardization of datasets, the vehicle logo detection task has become a hot topic in computer vision research. Vehicle log detection methods have evolved from traditional manual feature-based to the depth feature, achieving improved detection performance. In this section, we will briefly review these methods from three different aspects: the dataset, traditional detection method, and deep learning-based method.

2.1. VLD Dataset

Although the vehicle logo detection task has been studied for many years, there are few public datasets available for the computer vision community. The XMU [8] and HFUT-VL [9] datasets contain image data obtained from real-time road cameras. However, these datasets lack a division into training, validation, and test sets, and do not have a uniform image size. The VLD-30 [10] and VLR-40 [11] datasets have contributed to establishing classification standards and reconstructing the dataset division for vehicle logos. Nevertheless, there are still some issues with vehicle logos, such as low image resolution and a lack of real-world scenarios. The VLD-45 [12] dataset provides a large amount of data for vehicle logo detection tasks, consisting of 45,000 images and 45 classes from real-world and Internet acquisitions. In this paper, we use the VLD-45 to evaluate the precision of our method.

2.2. Traditional Detection Method

Previous research on vehicle logo detection has focused on predicting the bounding box and classification using manually designed feature extraction models. Commonly used methods include Scale-Invariant Feature Transform (SIFT operator) and Histograms of Oriented Gradient (HOG) for feature representation methods [13,14]. A Support Vector Machine (SVM) was used to combine HOG and predict the candidate region from the images [15,16]. Psyllos et al. [17] proposed a feature matching method for vehicle logos based on SIFT features, which realizes the 94% recognition accuracy with 10 categories. Peng et al. [18] used the Statistical Random Sparse Distribution (SRSD) for vehicle logo recognition, which improves the low-resolution image feature extraction. Sun et al. [19] combined the HOG features with SIFT features and used the SVM classifier to predict the classes of vehicle logo. In addition, most methods use manually designed feature extractors to complete the vehicle logo representation. This method is combined to achieve object location and classification by training strong and weak classifiers [20]. However, these methods have a limited ability to handle large amounts of data, resulting in lower generalization for the detector.

2.3. Deep Learning-Based Detection Methods

Deep learning-based methods have become the mainstream algorithms for detecting small-sized objects, such as vehicle logos. The deep features obtained through Convolutional Neural Network (CNN) training have better target representation ability. In addition, the learning method through adaptive learning is also better than the manually set feature matching template. Pan et al. [21] proposed the vehicle logo recognition method with a CNN, which compared the performance of the CNN and SIFT. The experiments proved the accuracy of CNNs to be greater than that of the SIFT model. Li et al. [22] combined the Hough transform with a deep neural network to detect the vehicle logo. It used Deep Belief Networks (DBNs) to complete the logo classification. Foo Chong Soon et al. [23] designed a CNN model based on an automatic searching method, and they hoped to construct the optimal target feature extractor. Liu et al. [24] used the ResNeXt network to improve the performance of matching restricted region extraction. Nguyen et al. [25] proposed a multi-scale feature fusion framework for achieving efficient feature extraction. Thus, extracting the detailed texture features of vehicle logos is still one of the important problems to improving the accuracy of object detection.

Recently, visual transformers have been applied for feature extraction based on deep learning, which uses self-attention networks to learn regional feature relationships. The backbone has better feature extraction capability for local context information from the images. However, the memory consumption and computation increase exponentially with the deepening of the network. Thus, the focus of this paper is to explore how to integrate transformer mechanisms into feature extraction networks.

3. Method

In this section, we will introduce the detailed pipeline of our proposed method, VLD self-attention. Our method consists of three sub-modules based on a deep learning object detector: attention feature extraction network, detection head, and training policy. By constructing the network block with attention and residual blocks for the backbone, we can create a robust representational model for texture feature extraction for small-sized objects.

3.1. Overview

As shown in Figure 2, our method takes RGB images as input data and reseizes them to 640 × 640 pixels. The backbone consists of 5 convolutional blocks and 1 Spatial Pyramid Pooling (SPP) network. We incorporate 2 attention blocks in the shallow layers to learn related relationship features and use the SPP network to fuse features of different scales, thus improving the utilization of shallow layers. Through supervision and self-attention mechanisms, our model enhances the extraction of texture information and reduces feature loss during the detection and location process for small-size objects.

For the detection head, we employ a feature sharing learning method to perform multi-scale object prediction. We use the concat layer to merge the feature map from the deep layer with the shallow layer. In our opinion, we can complete the target positioning task at different scales, which can provide an important reference for small-size objects. The predicted layer is refined from blocks 1 to 4, providing many detailed features to assist object location. Furthermore, we balance the classification and location loss during the training process to ensure the prediction accuracy of the detector for the location bounding box. The detailed structure of the method is described in this section for our method of vehicle logo detection.

3.2. Attention Feature Extraction Network

In our research, we analyzed and identified the limitations of traditional convolution networks that prioritize global feature representation while lacking the ability to extract local features. Especially for vehicle logos, the ratio of object to the whole map is usually ±0.2%. This causes most features to be deleted during feature extraction processing. At the same time, we need to consider extracting more detailed local texture features to maintain the detection precision.

From visual self-attention, it can learn global representations and construct the attention between the local pixels or regions. Our feature extraction network is designed to gather information around the object based on self-attention. In addition, pixel-level feature extraction allows us to complete feature fusion at different scales, which can make up for feature loss during the down-sampling process. Thus, we reconstructed the feature extraction network based on self-attention and convolution. Our backbone includes 5 blocks, including 2 layers of attention convolution and 3 layers of residual convolution. The SPP layer includes 3 scales for feature fusion, 16×, 8×, and 4×, which provides a better receptive field for small-target feature extraction.

From Figure 3, we can see the attention residual block. It uses the 3 × 3 kernels for the down-sampling of the images. Meanwhile, we designed the local attention model for learning the related features from the pixel correlation regions. The local attention model consists of a one-dimensional convolutional network, which can calculate the characteristics of pixel-related information and output them to the following convolution layer. According to the different input pixels, we define them as 1 × 1 × n dimensional matrices. For smoothing the gradient descent, we use the Mish function as the activation of residual feature extraction block. The Mish function is as follows:

M i s h = x \cdot \tanh (\ln (1 + e^{x}))

where x represents the output from the convolution, and tanh represents the hyperbolic tangent function. This function can ensure that the range of [−4, 0] is not truncated for the activation, which can reduce the gradient saturation problem. However, we only use this function in the local attention models. It can help update the weights of multiple residual networks in the process of reverse network propagation.

Regarding the convolution block, the Mish function is not suitable for our method. In our opinion from Figure 4, the value of the negative half axis is still helpful to the weight update and training process. For smaller or less complex objects, Leaky ReLU activation function is more effective in retaining important information compared to Mish. Further, the single-stage detection method needs to acquire more feature information, and the unbounded Leaky ReLU function is less affected by the gradient descent saturation problem. Therefore, the composition of our convolution block still uses the Leaky ReLU function as the activation function, along with a Batch Normalization (BN) block to avoid the over-fitting problem.

In addition, we built the feature smoothing part by using 2 of the 3 × 3 convolutional kernels from Figure 4. This achieved the fusion of input original information and residual processing information. The smoothing process ensures that the size of input image features is consistent with processed features, which enables the effective mapping of input feature details with down-sampled features, leading to better small-size objects’ feature extraction. This step only performs feature fusion and smooth feature processing, without activating functions. Then, the concat layer completes the feature fusion of the same size images on the channels. In this pair, we think that feature fusion of the same receptive field on the channel is more conducive to keeping the feature invariance of the scale space. At the same time, it can also obtain the edge, texture, and other details of small-size objects for subsequent detection tasks. At last, we use the 3 × 3 convolutional kernel to realize feature down-sampling after information fusion. In the attention feature extraction network, the feature fusion convolution block is used for block1, block 4, and block 5. The attention residual block is used for block 2 and block 3.

For the SPP network, our network uses it as the feature scale fusion module. It aims to solve the problem of classification errors caused by the scale exchange in small-size objects. Thus, this module primarily conducts fusion calculations based on the output feature maps of the deep layer. Further, the number of attention blocks and convolution blocks for each part is 4, 8, 16, 8, 4. The overall down-sampling rate of the attention feature extraction network is 32.

3.3. Detection Head

Our detection head is a typical single-stage detection framework based on anchor box generation, which includes three predicted layers of different scales. Additionally, we propose a feature fusion method that combines the shallow and deep layers. This approach enables the model to incorporate more detailed texture feature information, leading to improved accuracy in object classification and location.

In Figure 5, we show the predicted pipeline for object detection. The detection head takes the output feature map from the backbone as input data. Then, according to the divided of whole input data, we can acquire the grid cell from the image space. For each grid cell, our model generates candidate object regions using the anchor box generation method. From our analysis, through the method of pre-setting anchor box, the detection network can undergo obvious perceptual learning on the scale of the objects. Especially for small-size objects, the anchor box will help reduce the deviation range of bounding box regression calculation when training the model. However, many anchor box settings affect the subsequent detection running rate. Meanwhile, to avoid the influence of manually setting the anchor box size on the final prediction accuracy, we used K-means clustering to calculate the initial size of the anchor box.

For the anchor box, we built optimization merit functions for selecting the size of the box based on the K-means method. The Intersection over Union (IoU) represents the overlap ratio between targets. The distance function

D ()

is as follows:

D (b o x, c e n t r o i d) = 1 - I O U (b o x, c e n t r o i d)

(1)

b o x = w \times h

(2)

where centroid is the target center point. box represents the width and height of the bounding box. For each bounding box (

w \times h

), we use the loss function to find the optimal cluster number k and bounding box size (

w \times h

). The function

E (D, k)

is as follows:

E (D, k) = \frac{\sum_{i = 1}^{n} |D ((w_{i}, h_{i}), k_{j})|}{n} \times \frac{1}{k} j \in k

(3)

The formula calculates different loss results according to different k values. Then, according to the best result of loss, k is selected as the number of anchor boxes. Meanwhile, this method gives the corresponding size of (

w \times h

) for the anchor box.

For the prediction layer, we dealt with the problem of object scale change by setting three-scale prediction layers. At the same time, to compensate for the feature loss in the down-sampling process, we integrated the shallow features with the prediction layer to improve the positioning and classification accuracy in a single prediction.

3.4. Training Policy-Freezing

Typically, we need to spend a lot of time training models. At the same time, the mini-batch gradient descent optimization method is affected by changes in the data themselves. Thus, we propose the freeze training policy for the backbone to improve training efficiency and effectiveness. This method is implemented by adjusting the iteration steps and freezing the weights update. For our training policy, we divide the training iteration steps into three equal parts and freeze the training weight parameters in the second part. By setting the loss function of the threshold, we can update the weight parameters for the third part. This method is used to ensure the fast convergence of the first part and the optimal solution of the third part. The second part of the calculation is controlled by the loss threshold to improve the training efficiency. In addition, the effective control of loss changes is conducive to improving the robustness and accuracy of our model.

4. Experiments

In this section, we give an experimental evaluation of our method on VLD-45 datasets. According to our research, we carried out a multi-detector method contrast experiment, ablation experiment, and qualitative experiment. In addition, we focus on comparing the running rate and accuracy of the detection model, which can achieve the optimal detection performance through effective parameter regulation.

4.1. Datasets

For the experimental dataset, we use the VLD-45 object detection datasets [9]. It includes 45,000 images and 50,359 objects from 45 classes of vehicle logo, as shown in Figure 6 for the brand of the vehicle logo. According to the analysis of dataset, the proportion of the target is 0.2% in the whole image. Meanwhile, the average size of the object is 40 × 32 pixels. Thus, this dataset can be used to research small-size object detection. Figure 7 shows the samples of the VLD-45. This dataset includes the training dataset (20,025 images), valid dataset (14,985 images), and testing dataset (9990 images). We directly completed the method evaluation experiment on the original dataset. For the evaluation index, we used the average precision (AP) for giving the single class accuracy. The mean average precision (mAP) is applied to multi-class evaluation.

4.2. Parameters

From Figure 2 and Figure 5, our input data are resized to 416 × 416 pixels (32 of sampling rate), which keeps the balance of memory usage and feature requirements. Meanwhile, we use the pre-training model from the logo classification to improve the detection training. The number of anchor boxes has 9 sizes from 3 different scales. For the threshold of non-maximum suppression (NMS), we unify it and set it to the value of 0.5 for the Intersection over Union (IoU).

All of our experiments are trained and tested on the GPU of NVIDIA Tesla A8000. Regarding the training optimizer, we use the AMSGrad method for completing the weight update. Our models need to spend 80,000 iterations with a batch size of 32.

4.3. Comparison Experiments

In evaluating the detection performance, we chose mainstream detection methods for comparison, including Faster R-CNN [26], RefineDet [27], YOLOv3 [28], YOLOv4 [29], and our method. To facilitate analysis, the experiments provide detection accuracy in terms of average precision (AP) for 45 categories, along with overlap ratios and running times on the testing data of VLD-45. The results are presented in Table 1. Analyzing the results presented in Table 1 further emphasizes the effectiveness of our approach. The obtained average precision (AP) and mean average precision (mAP) for our method showcase a substantial improvement in detection accuracy when compared to the selected benchmark methods. This enhancement is particularly notable across diverse detection classes.

In contrast to the previous mAP result of 84.7%, our method achieves a remarkable 88.0% mAP. This indicates a noteworthy advancement in the model’s ability to accurately identify and classify objects in the given dataset. Importantly, this improvement in mAP is achieved with an efficient processing time of only 0.07 s per image, highlighting the practical viability of our method in real-time applications. Furthermore, the exceptional performance in the overlap ratio of results, reaching 89.3%, underscores the robustness of our method in providing precise regional accuracy.

The high overlap ratio signifies the model’s capability to deliver consistent and reliable results, crucial for applications where precise object delineation is paramount. In addition to its accuracy, our method also demonstrates significant advantages in terms of speed. By optimizing the feature extraction and processing pipelines, we have achieved a faster inference time compared to traditional object detection approaches, making our model highly suitable for real-time applications where rapid decision making is critical.

Notably, for challenging classes such as letter patterns (e.g., HAVAL, LAND ROVER, and Jeep), our method enhances detection precision by 3% to 5% in terms of AP. The experimental evaluation indicates that our method exhibits effective localization and classification for all categories. However, it is acknowledged that our method has not uniformly improved detection results across all categories, suggesting potential for enhancement in multi-category prediction capabilities. Hence, addressing the differentiation in features across multiple categories remains a key area for future research.

In summary, the results affirm the robustness, speed, and efficiency of our proposed method, positioning it as a promising solution for accurate and real-time object detection tasks. The high mAP, rapid processing time, and strong overlap ratio collectively contribute to the method’s practical utility and underline its potential for various applications in computer vision and object recognition, particularly where both speed and precision are critical.

4.4. Ablation Experiment for Our Method

Our exploration into the detection performance of three improved methods for our method reveals significant insights. As outlined in Section 3, we introduced three methods—attention feature extraction network (AFEN), detection head (DH), and freezing training policy (FTP)—aimed at enhancing detection results. Rigorous validation experiments were conducted for each method under controlled conditions.

The ablation results in Table 2 underscore the critical need for a robust backbone, such as the feature extraction network for VLD-Transformer. The evaluation demonstrates that compared to YOLOv4, our AFEN module achieves a baseline detection mAP of 0.855. This indicates the efficacy of AFEN in extracting discriminative features essential for accurate detection.

Furthermore, the improvement achieved by the detection head, coupled with the training policy, should not be overlooked. The amalgamation of the detection head and the freeze training policy results in an impressive 0.88 detection accuracy on the dataset. This highlights the synergistic effect of refining both the network architecture and the training strategy for enhanced performance.

While the training policy exhibits limited performance in model improvement, it emphasizes the importance of a holistic optimization approach. To achieve further advancements in detection accuracy, emphasis should be placed on meticulous design considerations within the detection framework. The experimental results underscore the complexity of optimizing the feature extraction model and the overall detection framework for superior performance in real-world scenarios.

4.5. Qualitative Results

Figure 8 shows the detection result for our method of VLD-45.

The results presented in Figure 8 illustrate the detection outcomes achieved by our method on the VLD-45 dataset. As depicted in the examples, it is evident that our approach yields favorable qualitative results. The qualitative analysis further supports the efficacy of our method in enhancing the accuracy of vehicle logo detection.

One notable observation is the precision exhibited in the detection of vehicle logos. Our method demonstrates a robust capability to accurately identify and delineate logos, even in complex scenarios or varied lighting conditions. This is indicative of the model’s adaptability and resilience in real-world applications. Moreover, the qualitative results suggest that our method excels in maintaining the integrity and clarity of detected logos. This is crucial for applications where precise logo recognition is essential, such as in autonomous driving systems or traffic monitoring.

The analysis of Figure 8 underscores the potential practical significance of our method in real-world scenarios, showcasing its ability to contribute to advancements in vehicle logo detection accuracy and reliability. Further quantitative assessments and comparisons with existing methods would provide a comprehensive evaluation of its performance against diverse benchmarks.

5. Conclusions

In this work, we propose an end-to-end framework for the task of vehicle logo detection. Our method focusses on solving the detection of small-size objects. Thus, we design an attention feature extraction network based on the self-attention mechanism, which combines multi-scale feature fusion with attention blocks to achieve robust feature representation. Then, we construct a detection head with multi-scale prediction for improving the locating precision. For the prediction layer, we design an up-sampling network for learning the detection parameters. The multi-scale prediction layer can fuse the feature map from the shallow layer to acquire the bounding box regression result. The whole model method can be used for parameter learning. In addition, we use the freeze training policy of multi-stages to adjust the training efficiency. According to the evaluation on the VLD-45 dataset, our method obtains the best detection performance on the vehicle logo classes of 45. Further, the ablation results prove the effectiveness of our method. However, our model still lacks balance in detection accuracy and running rate. In the future, we will reconstruct the detection framework itself to achieve real-time detection performance.

Author Contributions

Conceptualization, S.Y.; Methodology, Y.L. and Z.L.; Software, C.X. and X.D.; Writing, S.Y.; Revise, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Reform and Development of Local Universities (Disciplinary Construction) and the special research project of First-class Discipline of Inner Mongolia A. R. of China, grant number YLXKZX-ND-036.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author. Github: https://github.com/YS-KIT/VLD-45-B-DATASET-Detection (accessed on 2 September 2024) from Shuo Yang.

Acknowledgments

This work is supported by “Junma Plan” research topic from Inner Mongolia University with Shuo Yang.

Conflicts of Interest

Ziyue Liu is an employee of Company of IVIS. Changhua Xu is an employee of Company of Raying. Xueting Du is an employee of Company of Arcadia Company. The paper reflects the views of the scientists, and not the company.

References

Zhu, L.; Yu, F.R.; Wang, Y.; Ning, B.; Tang, T. Big Data Analytics in Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2019, 20, 383–398. [Google Scholar] [CrossRef]
Sahel, S.; Alsahafi, M.; Alghamdi, M.; Alsubait, T. Logo Detection Using Deep Learning with Pretrained CNN Models. Eng. Technol. Appl. Sci. Res. 2021, 11, 6724–6729. [Google Scholar] [CrossRef]
Jiang, X.; Sun, K.; Ma, L.; Qu, Z.; Ren, C. Vehicle logo detection method based on improved YOLOv4. Electronics 2022, 11, 3400. [Google Scholar] [CrossRef]
Moshayedi, A.J.; Uddin, N.M.I.; Khan, A.S.; Zhu, J.; Andani, M.E. Designing and Developing a Vision-Based System to Investigate the Emotional Effects of News on Short Sleep at Noon: An Experimental Case Study. Sensors 2023, 23, 8422. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Moshayedi, A.J.; Roy, A.S.; Liao, L.; Khan, A.S.; Kolahdooz, A.; Eftekhari, A. Design and Development of FOODIEBOT Robot: From Simulation to Design. IEEE Access 2024, 12, 36148–36172. [Google Scholar] [CrossRef]
Martyushev, N.V.; Malozyomov, B.V.; Kukartsev, V.V.; Gozbenko, V.E.; Konyukhov, V.Y.; Mikhalev, A.S.; Kukartsev, V.A.; Tynchenko, Y.A. Determination of the Reliability of Urban Electric Transport Running Autonomously through Diagnostic Parameters. World Electr. Veh. J. 2023, 14, 334. [Google Scholar] [CrossRef]
Huang, Y.; Wu, R.; Sun, Y.; Wang, W.; Ding, X. Vehicle Logo Recognition System Based on Convolutional Neural Networks With a Pretraining Strategy. IEEE Trans. Intell. Transport. Syst. 2015, 16, 1951–1960. [Google Scholar]
Yu, Y.; Wang, J.; Lu, J.; Xie, Y.; Nie, Z. Vehicle logo recognition based on overlapping enhanced patterns of oriented edge magnitudes. Comput. Electr. Eng. 2018, 71, 273–283. [Google Scholar] [CrossRef]
Yang, S.; Zhang, J.; Bo, C.; Wang, M.; Chen, L. Fast vehicle logo detection in complex scenes. Opt. Laser Technol. 2018, 110, 196–201. [Google Scholar] [CrossRef]
Meethongjan, K.; Surinwarangkoon, T.; Hoang, V.T. Vehicle logo recognition using histograms of oriented gradient descriptor and sparsity score. Telkomnika (Telecommun. Comput. Electron. Control) 2020, 18, 3019–3025. [Google Scholar] [CrossRef]
Yang, S.; Bo, C.; Zhang, J.; Gao, P.; Li, Y.; Serikawa, S. VLD-45: A Big Dataset for Vehicle Logo Recognition and Detection. IEEE Trans. Intell. Transp. Syst. 2021, 23, 25567–25573. [Google Scholar] [CrossRef]
Llorca, D.F.; Arroyo, R.; Sotelo, M.A. Vehicle logo recognition in traffic images using HOG features and SVM. In Proceedings of the 2013 16th International IEEE Conference on Intelligent Transportation Systems, The Hague, The Netherlands, 6–9 October 2013; pp. 2229–2234. [Google Scholar]
Satpathy, A.; Jiang, X.; Eng, H.L. LBP-Based Edge-Texture Features for Object Recognition. IEEE Trans. Image Process. 2014, 23, 1953–1964. [Google Scholar] [CrossRef] [PubMed]
Gu, Q.; Yang, J.; Cui, G.; Kong, L.; Zheng, H.; Klette, R. Multi-scale vehicle logo recognition by directional dense SIFT flow parsing. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3827–3831. [Google Scholar]
Sotheeswaran, S.; Ramanan, A. A Coarse-to-Fine Strategy for Vehicle Logo Recognition from Frontal-View Car Images. Pattern Recognit. Image Anal. 2018, 28, 142–154. [Google Scholar] [CrossRef]
Psyllos, A.P.; Anagnostopoulos, C.N.; Kayafas, E. Vehicle Logo Recognition Using a SIFT-Based Enhanced Matching Scheme. IEEE Trans. Intell. Transp. Syst. 2010, 11, 322–328. [Google Scholar] [CrossRef]
Peng, H.; Wang, X.; Wang, H.; Yang, W. Recognition of Low-Resolution Logos in Vehicle Images Based on Statistical Random Sparse Distribution. IEEE Trans. Intell. Transp. Syst. 2014, 16, 681–691. [Google Scholar] [CrossRef]
Sun, Q.; Lu, X.; Chen, L.; Hu, H. An Improved Vehicle Logo Recognition Method for Road Surveillance Images. In Proceedings of the 2014 7th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 3–14 December 2014; pp. 373–376. [Google Scholar]
Liao, Y.; Lu, X.; Zhang, C.; Wang, Y.; Tang, Z. Mutual enhancement for detection of multiple logos in sports videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4856–4865. [Google Scholar]
Pan, C.; Yan, Z.; Xu, X.; Sun, M.; Shao, J.; Wu, D. Vehicle Logo Recognition Based on Deep Learning Architecture in Video Sur-veillance for Intelligent Traffic System. In Proceedings of the IET International Conference on Smart and Sustainable City, Shanghai, China, 19–20 August 2013; pp. 123–126. [Google Scholar]
Huan, L.; Li, W.; Yujian, Q. Vehicle Logo Retrieval Based on Hough Transform and Deep Learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, Italy, 22–29 October 2017; pp. 967–973. [Google Scholar]
Soon, F.C.; Khaw, H.Y.; Chuah, J.H.; Kanesan, J. Hyper-parameters optimisation of deep CNN architecture for vehicle logo recognition. IET Intell. Transp. Syst. 2018, 12, 939–946. [Google Scholar] [CrossRef]
Liu, R.; Han, Q.; Min, W.; Zhou, L.; Xu, J. Vehicle Logo Recognition Based on Enhanced Matching for Small Objects, Constrained Region and SSFPD Network. Sensors 2019, 19, 4528. [Google Scholar] [CrossRef]
Nguyen, H.O. Vehicle Logo Recognition Based on Vehicle Region and Multi-scale Feature Fusion. Theor. Appl. Inf. Technol. 2020, 98, 3327–3337. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4203–4212. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]

Figure 1. Example of proportion of vehicle logo. The proportion of small objects in this paper is ±0.2%.

Figure 2. The pipeline of our method. It includes the down-sampling feature extraction network with attention and a multi-scale feature fusion detection head.

Figure 3. Example of structure for attention residual block. It helps the feature extraction network to establish the local texture feature monitoring mechanism.

Figure 4. Example of structure for feature fusion convolution block.

Figure 5. Example of structure for our detection head. We propose the multi-scale prediction method based on anchor box generation. It includes an anchor box from the three scales for solving the scale change problem of small-size objects.

Figure 6. The example of VLD-45 dataset for 45 categories.

Figure 7. The samples of detailed VLD-45 dataset.

Figure 8. The examples of qualitative results for our method on the VLD-45 dataset.

Table 1. The results of the detection task.

Number	Classes	Faster RCNN [23]	RefineDet [24]	YOLOv3 [25]	YOLOv4 [26]	Our Method
0001	BAIC GROUP	0.863	0.956	0.882	0.915	0.962
0002	Ford	0.724	0.817	0.732	0.802	0.862
0003	SKODA	0.723	0.794	0.692	0.831	0.825
0004	Venucia	0.914	0.914	0.893	0.929	0.948
0005	HONDA	0.874	0.837	0.847	0.853	0.871
0006	NISSAN	0.973	0.854	0.853	0.871	0.903
0007	Cadillac	0.925	0.715	0.741	0.852	0.885
0008	SUZUKI	0.945	0.783	0.842	0.834	0.934
0009	GEELY	0.785	0.746	0.712	0.784	0.806
0010	Porsche	0.734	0.604	0.694	0.736	0.745
0011	Jeep	0.726	0.693	0.652	0.81	0.833
0012	BAOJUN	0.912	0.827	0.835	0.883	0.875
0013	ROEWE	0.873	0.814	0.742	0.825	0.882
0014	LINCOLN	0.747	0.796	0.804	0.748	0.829
0015	TOYOTA	0.764	0.867	0.867	0.857	0.895
0016	Buick	0.837	0.794	0.839	0.768	0.815
0017	CHERY	0.719	0.813	0.796	0.821	0.858
0018	KIA	0.734	0.828	0.763	0.792	0.86
0019	HAVAL	0.572	0.574	0.525	0.622	0.734
0020	Audi	0.862	0.864	0.843	0.823	0.893
0021	LAND ROVER	0.432	0.405	0.354	0.514	0.606
0022	Volkswagen	0.932	0.912	0.935	0.897	0.947
0023	Trumpchi	0.836	0.852	0.895	0.846	0.903
0024	CHANGAN	0.859	0.807	0.828	0.931	0.866
0025	Morris Garages	0.875	0.916	0.879	0.938	0.948
0026	Renault	0.792	0.894	0.905	0.869	0.913
0027	LEXUS	0.868	0.853	0.879	0.847	0.897
0028	BMW	0.782	0.795	0.798	0.915	0.882
0029	MAZDA	0.879	0.841	0.864	0.849	0.895
0030	Mercedes-Benz	0.905	0.894	0.915	0.895	0.928
0031	HYUNDAI	0.873	0.885	0.873	0.873	0.904
0032	Chevrolet	0.713	0.672	0.654	0.714	0.788
0033	BYD	0.934	0.855	0.817	0.925	0.916
0034	PEUGEOT	0.783	0.742	0.695	0.857	0.895
0035	Citroen	0.828	0.756	0.712	0.851	0.904
0036	Brilliance Auto	0.897	0.915	0.902	0.9	0.927
0037	Volovo	0.921	0.873	0.853	0.91	0.935
0038	Mitsubishi	0.837	0.899	0.784	0.948	0.936
0039	Subaru	0.846	0.847	0.762	0.876	0.897
0040	GMC	0.884	0.865	0.783	0.933	0.914
0041	Infiniti	0.879	0.833	0.865	0.915	0.875
0042	FAW Haima	0.924	0.832	0.857	0.943	0.951
0043	SGMW	0.886	0.886	0.874	0.937	0.927
0044	Soueast Motor	0.802	0.793	0.775	0.784	0.932
0045	QOROS	0.873	0.847	0.821	0.908	0.914
MAP		0.828	0.812	0.812	0.847	0.880
Average Overlap (%)		87.6%	80.5%	80.5%	86.4%	89.3%
Times (s)		1.7	0.05	0.05	0.09	0.07

Table 2. Ablation results on the VLD-45 dataset. ✓ represent whether we use the sub-module.

	AFEN	DH	FTP	mAP/%	Improved
(a)	✓			0.855
(b)	✓	✓		0.865	+0.12
(c)	✓		✓	0.873	+0.08
(d)	✓	✓	✓	0.880	+0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, S.; Liu, Y.; Liu, Z.; Xu, C.; Du, X. Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application. World Electr. Veh. J. 2024, 15, 467. https://doi.org/10.3390/wevj15100467

AMA Style

Yang S, Liu Y, Liu Z, Xu C, Du X. Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application. World Electric Vehicle Journal. 2024; 15(10):467. https://doi.org/10.3390/wevj15100467

Chicago/Turabian Style

Yang, Shuo, Yisu Liu, Ziyue Liu, Changhua Xu, and Xueting Du. 2024. "Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application" World Electric Vehicle Journal 15, no. 10: 467. https://doi.org/10.3390/wevj15100467

APA Style

Yang, S., Liu, Y., Liu, Z., Xu, C., & Du, X. (2024). Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application. World Electric Vehicle Journal, 15(10), 467. https://doi.org/10.3390/wevj15100467

Article Menu

Enhanced Vehicle Logo Detection Method Based on Self-Attention Mechanism for Electric Vehicle Application

Abstract

1. Introduction

2. Related Work

2.1. VLD Dataset

2.2. Traditional Detection Method

2.3. Deep Learning-Based Detection Methods

3. Method

3.1. Overview

3.2. Attention Feature Extraction Network

3.3. Detection Head

3.4. Training Policy-Freezing

4. Experiments

4.1. Datasets

4.2. Parameters

4.3. Comparison Experiments

4.4. Ablation Experiment for Our Method

4.5. Qualitative Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI