1. Introduction
In recent years, with the advancement of autonomous driving, advanced battery, and artificial intelligence technologies, robots have been developed into autonomous mobile robots (AMRs) that can autonomously perform assigned missions while moving through various spaces. Autonomous driving, obstacle avoidance, and object detection are essential functions that enable AMRs to perform their tasks efficiently. In particular, object detection technology is one of the key technologies that forms the foundation for other parts. In recent years, various deep learning networks based on convolutional neural networks (CNNs) have been proposed for image-based object detection [
1,
2,
3,
4]. However, the proposed CNN-based object detection models are trained with a large-scale image dataset that is collected without considering the environment. When these models are applied to AMRs without considering the mission environment, object detection accuracy for robots carrying out missions in a specific environment decreases, resulting in failed missions, as shown in
Figure 1. Such object detection errors may cause a security surveillance robot at an airport or a patient care robot in a hospital to fail in its mission.
Yin Cui et al. mentioned the class imbalance problem as a major cause of such object detection errors [
5]. The class imbalance problem occurs when the occurrence frequencies of the classes in the training dataset are significantly different. Real-world datasets exhibit the distribution of a long-tailed dataset, as shown in
Figure 2a, where a small number of classes constitute most of the data, and the remaining classes make up the minority. Hence, the class imbalance problem exists in real-world datasets, and this causes critical problems in the performance of the object detection models. For example, as shown in
Figure 1a, the advertising panel advertising a car at an airport terminal is recognized as a car, which occurs with high frequency in the training data. In
Figure 1b, the special light hanging on the wall in a hospital room is incorrectly recognized as a wall-mounted air conditioner. These problems occur with high frequency in the training data. In particular, this type of problem may become more serious for AMRs that are required to perform a mission targeting a special object in a specific environment.
Studies have been conducted to solve this class imbalance problem, and we mainly use two strategies: the re-sampling technique and the re-weighting technique. The re-sampling technique is a method that adds (oversampling) or removes (under-sampling) data according to the occurrence frequencies of classes [
6,
7,
8]. In the oversampling method, overfitting can occur because of overlapping training data, which is a possible cause of the poor generalization performance of the trained model. Similarly, in the undersampling method, the accuracy of the trained model can decrease owing to insufficient training data. Because of these performance degradation problems, recent studies have been conducted on the re-weighting approach using a new loss function that works well with the long-tailed dataset with class imbalance. The re-weighting technique is a method that adjusts class imbalance in the training process by adjusting the loss function to assign high weights to the samples of the low-frequency classes and low weights to the samples of the classes that frequently occur [
9,
10]. Representative examples include the focal loss used in RetinaNet [
11], class-balanced loss proposed by Y. Cui [
5], and LDAM loss proposed by K. Cao [
9].
Such a class imbalance shows different characteristics depending on the environment. As shown in
Figure 2, the datasets for different specific environments will have different distributions of class imbalance. In addition, the order of occurrence frequencies of classes changes depending on the environment. For example, the class index “Pillow” occurs more frequently than “Luggage” in the overall environment, as shown in
Figure 2a. However, in the airport terminal environment, the class index “Luggage” occurs much more frequently than “Pillow”, and the class index “Toilet” occurs very rarely, as shown in
Figure 2b. Moreover, the distribution of class index “Pillow” is the highest in the hospital room environment, as shown in
Figure 2c. Different distributions across these environments indicate that solving the class imbalance problem in the entire environment does not mean solving the problem in a specific environment. In other words, a model trained in the overall environment through the loss function with the re-weighting technique may have low object detection accuracy in a specific environment. Hereinafter, the problem of varying frequencies of occurrence of classes in the dataset depending on the environment will be referred to as
environment class imbalance.To overcome the environment class imbalance for an AMR performing its mission while moving in various environments, we propose the multi-model-based object detection framework with environment-context awareness (M-ODF). The M-ODF provides a systematic solution that effectively resolves the environment class imbalance problem by using multiple object detection models for each environment. To this end, we first present a training process that creates multiple object detection models based on each environmental context. Next, we use a lightweight screen classification method for environmental context recognition so that the AMR can select an object detection model according to each environmental context. Finally, we propose a model caching algorithm for efficient use of multiple object detection models. Through experimental results, we demonstrate that the M-ODF can effectively overcome the environment class imbalance problem.
The rest of this paper is organized as follows.
Section 2 introduces previous research on CNN-based object detection and studies conducted to solve the class imbalance problem.
Section 3 describes the mechanisms for creating multiple models and effectively using the models generated for each environment to mitigate the environment class imbalance problem.
Section 4 analyzes the proposed method and the existing state-of-the-art deep learning model through experiments. Finally,
Section 5 concludes the paper and discusses directions for future research topics.
2. Related Work
In recent years, various deep learning models based on CNNs have been proposed for object detection. The CNN-based object detection method is divided into a two-stage model that sequentially processes the region proposal (also known as the bounding box) and region classification, and a one-stage model that processes the region proposal and region classification in parallel [
12].
The two-stage model first extracts candidate regions in which objects are likely to exist using region proposal. It then performs object classification using a CNN-based classifier to recognize objects in each region. The two-stage model takes more time because it needs to process the region proposal and region classification sequentially. R-CNN [
13], Spp-Net [
14], Fast R-CNN [
1], and Faster R-CNN [
2] are representative object detection models based on the two-stage model. The one-stage method that processes the region proposal and object recognition simultaneously was proposed to solve the limitations of two-stage-based object detection models, which are difficult to apply to real-time applications. YOLO [
15], SSD [
16], YOLOv2 [
3], DSSD [
17], and YOLOv3 [
4] are representative one-stage object detection networks. Although the one-stage model improved the object detection speed, it has the problem of low accuracy. Recently, the latest versions of YOLO networks, such as YOLOv4 [
18] and YOLOv5 [
19], have greatly improved the accuracy. However, this still does not mitigate the problem of environment class imbalance, as shown in
Figure 1. In this study, we use YOLOv5 as the base object detection network for our proposed M-ODF. Because the class imbalance problem is the main reason for the degradation of object recognition accuracy, studies have been conducted using re-sampling and re-weighting approaches.
In the initially proposed re-sampling approach, data for training are oversampled or undersampled according to the occurrence frequency of each class in the raw dataset [
6,
7,
8]. More recently, re-weighting methods have been proposed that adjust the loss based on the class distribution of the dataset. Li et al. proposed a gradient harmonized single-stage detector to solve this imbalance problem [
10]. They reported that the class imbalance problem can be summarized as the imbalance of the gradient norm. Their detector determines the weight of the samples according to the gradient density. Notably, Cao. et al. proposed a label-distribution-aware margin (LDAM) loss function [
9]. LDAM resolved the class imbalance problem using the margin boundary of each class. Cui et al. also proposed a class-balanced loss based on the effective number of samples, defined as the volume of samples [
5]. They designed the class-balanced loss function to address the problem of training on imbalanced data by introducing a weighting factor that is inversely proportional to the effective number of samples. However, they still failed to mitigate the problem of environment class imbalance caused by the difference in class distributions according to the environment.
3. Multi-Model-Based Object Detection Framework with Environment-Context-Awareness
To overcome the environment class imbalance problem, we propose the multi-model-based object detection framework with environment-context awareness (M-ODF) that efficiently uses multiple object detection models trained in each environmental context.
Figure 3 shows the object recognition process of M-ODF while an AMR moving through various mission spaces (e.g., spaces in a shopping mall) performs tasks.
Figure 3a (object detection step) shows the process of object detection of AMR. The object detection function is triggered in every input video frame to support real-time applications. In this process, object detection is performed using a model corresponding to the environmental context of the space in which the AMR is currently performing its mission.
Figure 3b (environmental context recognition step) shows the process of recognizing the environmental context for the space in which the AMR is performing its mission. A scene classification technique is used to recognize the environmental context. We also propose a lightweight scene classification scheme to reduce the computing resources. Finally,
Figure 3c (model caching step) shows the model-caching process to quickly switch the object detection model when the environmental context of AMR is changed. We propose a transition probability-based model caching mechanism in which AMR preferentially caches the model corresponding to the environmental context of the neighboring space with high movement probability.
In this section, we describe the main algorithms of M-ODF that are used to efficiently create and utilize multiple models; that is, multi-model-based object detection, lightweight scene classification, and transition-probability-based model caching.
3.1. Multi-Model Based Object Detection
Previous studies used re-sampling and re-weighting methods to overcome the class imbalance problem. These studies resolved the class imbalance problem using only a single object-based model. Therefore, we will refer to the previous methods collectively as a single model-based object detection framework (S-ODF). S-ODF cannot solve the problem of environment class imbalance, in which the frequency of occurrence of classes in a dataset changes according to the environment. To tackle this issue, we propose M-ODF, which mitigates the environment class imbalance by effectively utilizing the multiple object detection model trained according to the environment.
Environment-specific datasets are required to train object detection models specialized for each environment.
Figure 4 shows the procedure for training environment-specific object detection models. To obtain object detection models for each environment, we classify the training dataset by environmental context and train each model using the divided datasets. First, as shown in
Figure 4a, we apply a scene classifier to the training dataset, which is not divided by the environmental context, to create environment-specific datasets. The class (e.g., airport terminal or bookstore) used in the scene classifier should be carefully selected to sufficiently represent the environment in which the AMR is carrying out its mission. The next step is the training process, wherein we train object detection models with environment-specific datasets. However, the size of an environment-specific dataset created by classifying according to the environment from the original is insufficient to train the object detection model. To handle the problem of the small training dataset, we use transfer learning, a machine learning method that reuses a pretrained model as the starting point for a model. Additionally, we use K-fold cross-validation to achieve data augmentation effects (
Figure 4b).
3.2. Lightweight Scene Classification
To select an appropriate model according to the environment, it is necessary to accurately recognize the environmental context of the space in which an AMR is currently performing its task. Scene classification helps to determine the current environmental context. However, frequently performing scene classification causes the AMR to consume many resources. Therefore, we propose a lightweight scene classification method to alleviate the resource consumption of the AMR. The proposed scene classifier performs scene classification to recognize the environmental context only when a scene change is detected through the lightweight scene change detector. As a result, the proposed scene classifier uses fewer resources than the periodic scene classifier. The proposed lightweight scene classifier consists of scene change detection and a scene classification process.
The scene change detection process is further divided into image preprocessing and scene change detector modules. The image preprocessing module decreases the noise in the raw image from the camera and reduces the size of the image to help with lightweight scene change detection. The image processing first applies the Gaussian blur filter to reduce noise because the noise in the images collected from the camera may affect the scene change detection. Next, max pooling is used on images to which the filter has been applied to reduce the load on the scene change detection. By using max pooling, the sizes of the images are reduced while maintaining their characteristics. The scene change detector detects changes in the color and brightness between the previous and the current scene to detect the scene change. We convert the preprocessed image into grayscale and HSV (hue, saturation, value) images to recognize changes in color and brightness features. The grayscale image is used to represent the brightness and the HSV image is used to express the hue and saturation of the RGB image. The two feature images are transformed into histogram vectors. The histogram vectors are then compared with those of the previous scene, and a scene change is detected based on the amount of change in the histogram.
Figure 5 illustrates the process of recognizing the environmental context through lightweight scene classification and object detection using models stored in the model store. The lightweight scene classification performs scene change detection according to the preset interval for
, as shown in
Figure 3b. The scene change is determined based on the change of color and brightness feature values between the previous and the current scene. If the amount of change measured is larger than the predefined reference value, it is determined that the scene has changed. When a scene change is detected, scene classification is performed to recognize the new environmental context of the current space of the AMR. If the environmental context of the scene changes, object detection is performed using the new model corresponding to the new environmental context [
20].
3.3. Transition Probability-Based Model Caching
The appropriate model for the environment should be cached in the GPU in advance to perform object detection in each environment. If the object detection model is not cached, delays may occur due to the time required to load the model. To address these problems, we propose transition-probability-based model caching. The basic idea of our proposed model caching is to precache the model for the environment corresponding to the neighboring space with the highest transition probability, which is the probability of moving from the current state to the next state. The transition probability is an element of the converged state transition matrix calculated based on the continuous movement of AMR in the mission space. The transition probability
used in the state transition matrix is the probability of moving from
to
, as shown in Equation (
1), and if it is not possible to move from
to
, the transition probability is zero.
When the AMR moves from
to
,
is updated through Equation (
2).
Here, denotes the total number of environment change detections performed in state , and represents the number of transitions from state to state .
From the converged state transition matrix, we can determine the highest transition probability of the AMR moving from the current state to the next state. Therefore, precaching the model in order of the highest transition probability increases the probability of using the cached model. The hit rate also increases as the cache size increases, and if the cache size is larger than the maximum degree (the number of edges in a state) of the state transition diagram, the hit rate will be 100%. The result of the state transition matrix learned by an AMR performing tasks in the mission environment (
Figure 6a) is shown in
Figure 6b, and
Figure 6c illustrates the state transition diagram of the AMR.
Figure 7 shows how the state transition matrix is calculated when the cache size is three in the environment shown in
Figure 6a, and how the state transition matrix is utilized for model caching. As shown in
Figure 7, in scenario (a), an AMR moves from
to
and then to
(
→
→
) according to the path. The AMR first performs scene classification to recognize the environmental context of the current space (
). Thereafter, it loads the object detection model (
) suitable for the environmental context (
) into system memory and updates the transition probability
to one. In scenario (a), environment change detection occurs five times. Four out of five times, an environment change is not detected, and the state changes from
to
at the fifth environment change detection. Hence,
is updated to 0.8 and
to 0.2. Moreover, in state
, the environment change detection was performed ten times. Nine out of ten times, the environment does not change, but the state changes from
to
for the tenth environment change detection. Hence,
is updated to 0.9 and
to 0.1. In this manner, the state transition matrix is continuously updated at each
interval. When the number of models cached in the AMR reaches the cache size, they are deleted from the cache in ascending order of transition probability.