1. Introduction
Object detection plays a pivotal role in the field of computer vision and underpins advanced visual tasks like behavior comprehension, scene interpretation, and visual question answering. This technology is extensively utilized in diverse applications ranging from everyday safety, robot navigation, and video surveillance to scene detection and aerospace. Despite its widespread use, object detection continues to pose difficulties due to factors including object occlusion, substantial scale differences, and background clutter. The pursuit of robust and efficient object detection methods carries substantial practical significance.
Indoor scene object detection is an important field within object detection [
1], with broad application prospects in smart home devices, indoor service robots, and security monitors [
2]. Indoor complex scenes refer to indoor environments with complex background interference, object occlusion, and large-scale variations [
3]. Compared to traditional outdoor scene object detection, indoor complex scene object detection is more difficult, making it a cutting-edge research topic.
Two primary phases of research have been conducted on indoor object detection: the first used hand-crafted feature-based algorithms, and the second involved deep learning-based methods. Techniques that rely on handmade features frequently make use of background data, such as texture, color, and shape, to create shallow features that are then used for object detection and localization. These methods have the advantage of strong interpretability. To effectively address these challenges, researchers have conducted extensive work in object detection. Traditional object detection algorithms focused primarily on feature design, with researchers both domestically and internationally proposing a series of excellent feature operators, such as SIFT [
4], HOG [
5], and LBP [
6]. These manually designed features achieved good results in certain specific scenarios but lacked universality and limited representational power for objects, making it difficult to describe objects with high-level semantic information and thus not applicable to all complex scenes.
Deep learning-based techniques have improved detection outcomes recently by using convolutional neural networks to extract features, which are better at expressing semantic information. These systems may be roughly categorized into two types: one-stage algorithms based on the regression concept and two-stage algorithms based on the candidate region idea, depending on whether the detection procedure includes the candidate area proposal phase. Two-stage algorithms include typical methods such as R-CNN, Fast-RCNN, SPP-Net, Faster-RCNN, R-FCN, Mask-RCNN, and Cascade-RCNN [
7,
8,
9,
10,
11,
12,
13]. One-stage algorithms directly produce class probabilities and position coordinates of objects without generating candidate regions, with typical methods including YOLOv1, YOLOv2, YOLOv3, YOLOv4, YOLOX, YOLOv7, YOLOv9, SSD, DSSD, and Retina Net [
14,
15,
16,
17,
18,
19,
20,
21,
22,
23]. Convolutional neural network-based object detection models have strong fitting and generalization capabilities and have been deeply applied in indoor object recognition tasks. Saurabh Gupta et al. [
24] proposed an algorithm based on the gPb-ucm method, which improved object boundary detection and hierarchical segmentation using depth information and used RGB-D images for indoor scene understanding. Oliver Mattausch et al. [
25] proposed an automatic segmentation method for indoor scenes using spectral clustering to identify object categories for the automatic detection and classification of objects in large-scale indoor scan point clouds. Georgios Georgakis et al. [
26] proposed an automated method to generate synthetic training data, considering both geometric and semantic information, aiming to address the object detection problem for service robots in indoor environments. XiaoYu Yao [
27] proposed an optimization algorithm based on the regularity of indoor scene object attributes, combining prior information with detection results, and utilizing the ability of recurrent convolutional neural networks (RCNNs) to process sequential data to extract global image information and improve the accuracy of indoor object detection. JunJie Chen [
28] proposed an indoor complex scene object detection model based on cross-attention mechanisms, proposing sample adaptive mixed data augmentation methods based on classification confidence and indoor complex scene object detection algorithms for few-shot learning, improving the performance of object detection in indoor complex scenes. Employing an advanced dual-threshold non-maximum suppression (DT-NMS) algorithm to address occlusion problems in indoor environments, Ni et al. [
29] introduced an SSD-based object detection method that enhances indoor object recognition at the regional concept stage. The MCIE module is utilized to gather contextually significant information from indoor settings. KPA Gladis et al. [
30]. proposed the Indoor-Outdoor YOLO Glass network (In-Out YOLO), an object detection model-based video system, with innovations in adaptive spatial pyramid pooling-based Squeeze and Attention Block with YOLO, to assist visually impaired individuals in navigating and identifying objects both indoors and outdoors, addressing the challenges of low-power wearable devices and improving their independence and quality of life.
In recent years, the detection transformer (DETR) architecture based on the Transformer structure [
31] has become a new paradigm for object detection. These models abandon performance-limiting handcrafted modules and rely on attention mechanisms to match queries with features, treating object detection as an end-to-end set prediction task. They adopt a hybrid structure of Transformer and neural networks, viewing object detection as a set prediction problem, decoupling the relationship between position and prediction, simplifying the entire object detecting procedure, eliminating operations such as NMS that affect detection performance, and demonstrating strong potential in tasks involving complex scales and dense objects due to their better global modeling capabilities. He et al. [
32] separated cross-attention into two separate branches, initializing content queries in the decoder with a little detector, improving content queries in the early stages of training by predicting classification and regression embeddings, and introducing a pairwise self-attention mechanism in the decoder to consider spatially adjacent object query pairs and utilize their spatial context information. ZHANG et al. [
33] proposed the DINO model based on Deformable DETR. DINO employs dynamic anchor boxes (DABs) and denoising training (DN) techniques, using deformable attention mechanisms to improve computational efficiency. By simultaneously presenting the model with positive samples (small noise) and negative samples (large noise) of the same real target, the model is helped to distinguish and avoid repeated detection of the same target. Combining the localization queries from the encoder output with learnable content-rich queries improves the initialization process of anchor boxes and uses refinement box information from subsequent layers to optimize the parameters of adjacent earlier layers, improving the accuracy of box prediction.
However, there are still some difficulties in practical applications:
Indoor environments often have more complex backgrounds and diverse targets. Frequently, due to factors like lighting and occlusion, the shapes and sizes of the same type of objects vary greatly, leading to a reduction in detection accuracy.
Obtaining annotated data for indoor object detection is inherently challenging due to the complexity of indoor scenes. The annotation cost is exacerbated by the diverse and cluttered nature of indoor environments, where objects can be partially or fully occluded, and backgrounds can vary significantly. Furthermore, the frequency of occurrence of certain objects within these environments often skews, leading to class imbalance. This imbalance can cause models to favor the detection of frequently occurring objects during training, thereby neglecting less common objects and resulting in lower detection accuracy for rare instances. The need for precise localization and the high variability in object arrangements contribute to the difficulty and cost of annotation, making it more challenging compared to other scenarios where objects are more uniformly distributed and less occluded.
Indoor environments exhibit a wide range of object scales, posing a significant challenge for object detection models. This scale variation is due to the presence of both small, everyday items such as keys and phones, and large pieces of furniture like sofas and bookshelves. The size difference between these objects can span several orders of magnitude, which complicates the detection process. Small objects require high-resolution detection capabilities, while large objects demand the model to maintain accuracy across a broader spatial context. This variability in scale is more pronounced in indoor settings compared to outdoor scenes, where objects tend to be more uniformly distributed in size and perspective.
This scale variation poses a significant challenge for object detection models, making it difficult for them to effectively handle objects of all scales simultaneously. Therefore, achieving robust and efficient object detection algorithms in indoor complex scenes is of paramount importance.
To tackle these challenges, this study introduces a novel object detection methodology tailored for indoor complex environments. Unlike existing methods, our approach leverages an enhanced DINO model, which employs an advanced Res2net as the backbone for feature extraction. This backbone is integrated with deformable attention mechanisms to capture salient feature information more effectively, aiding in the differentiation of similar class objects from their backgrounds. Additionally, our method incorporates an improved feature pyramid network, GBi-FPN, in the neck to facilitate more accurate recognition post-fusion of deeper features. To expedite the convergence process, we utilize SIoU loss, which enables precise identification of indoor objects amidst complex backgrounds. This research offers a groundbreaking solution for indoor complex scene object detection applications, including smart homes, indoor service robots, and security surveillance systems, demonstrating significant improvements over current methods in terms of accuracy, efficiency, and robustness.
2. Materials and Methods
2.1. Datasets
The currently available indoor scene object detection datasets, such as NYU-Depth V2 and SUNRGB-D, primarily include depth information and are designed for three-dimensional reconstruction tasks, making them unsuitable for our research focus on two-dimensional objects. NYU-Depth V2, with its limited 1449 images and only six distinct object categories, severely restricts the generalization capabilities of object detection models across diverse indoor settings. SUNRGB-D, while offering a more extensive range of 32 object categories, suffers from a significant imbalance in the distribution of these categories across its 10,345 images, leading to model biases towards frequently occurring objects and compromising detection accuracy for less common ones. Even datasets like GMU-Kitchen Scenes, which can be used for two-dimensional object detection, are captured via depth cameras and annotated by selecting frames from video clips. Most of these annotated images originate from the same video segment, leading to highly similar issues. Additionally, these datasets suffer from problems such as limited scenes, single shooting angles, and insufficient categories and quantities of objects, all of which do not meet the requirements of our indoor complex scene object detection task. Therefore, it is necessary to construct a high-quality dataset with rich categories and ample sample sizes.
The COCO object detection dataset contains 80 different classes of objects and over 200,000 labeled images, making it the most widely used dataset for object detection. The COCO dataset includes common indoor objects such as furniture, appliances, and people, and encompasses complex scenes. Therefore, we have constructed a dataset suitable for our research application scenarios based on the COCO dataset. As shown in the
Figure 1, we have built an Indoor Complex Scene Object Detection Dataset (referred to as Indoor-COCO hereafter) based on the COCO object detection dataset. The dataset construction process is as follows: We programmatically and manually selected complex indoor scene images from the COCO object detection dataset that met conditions such as complex backgrounds, significant object occlusion, and large-scale variations. This resulted in a dataset containing 11,302 images and eight categories. The table displays the statistical distribution of the dataset’s categories. The training and test sets were divided using an 8:2 ratio through random allocation, resulting in 2261 photographs for the test set and 9041 images for the training set. The dataset we developed meets the criteria for indoor complex scene object recognition tasks, as it encompasses a wide range of categories, boasts a substantial number of samples, features intricate scenes, and includes diverse viewing angles.
2.2. Indoor Object Detection Model Based on DINO
The design of the I-DINO (Indoor-DINO)-based Indoor Object Detection Model, as illustrated in
Figure 2, typically consists of a backbone network, a neck, multiple Transformer encoder and decoder layers, and numerous prediction heads. The input image undergoes feature extraction through an advanced Res2Net backbone network, which enhances feature extraction by incorporating multi-scale receptive fields, crucial for detecting objects of varying sizes within cluttered indoor scenes. This backbone network, with increased depth, altered convolution and normalization layers, and integrated pre-trained weights, extracts four feature maps at different scales. To combat occlusions and capture detailed features of objects, especially in complex indoor backgrounds, the backbone network incorporates a deformable attention mechanism (Deformable Attention) module. This mechanism adaptively adjusts the attention window to align with object contours, thereby improving the model’s ability to handle occlusions and complex backgrounds. The extracted feature maps are then fed into the neck feature fusion module, GBi-FPN (Grid Bi-directional Feature Pyramid Network). This improved feature fusion module integrates features from different scales, providing a comprehensive representation that mitigates the impact of scale variations in indoor objects. The enhanced semantic information accuracy through feature fusion is crucial for accurate object detection in diverse indoor environments. Subsequently, the fused feature maps and positional information are input into the deformable Transformer’s encoder and decoder. The encoder processes the feature maps to capture contextual information, while the decoder refines the features to generate precise object proposals. Finally, image features extracted near the reference points are sent to multiple prediction heads (Prediction heads) to complete the end-to-end precise recognition process. To further improve the accuracy of bounding box predictions, especially in scenarios with significant object overlap, the model employs the SIoU Loss Function. This loss function considers not only the overlap area but also the shape and orientation of the boxes, ensuring more accurate localization and detection of objects in complex indoor scenes.
2.3. Res2Net Network
Res2Net is a multi-scale backbone network architecture first proposed by GAO et al. [
34], originating from an improvement on the ResNet structure. Its multi-scale nature stems from multiple receptive fields at a finer granularity level, distinguishing it from methods that increase multi-scale by utilizing different resolutions. As illustrated in
Figure 3, Res2Net divides the 3 × 3 filters in the original ResNet’s Bottleneck Block structure into n smaller filter groups, forming the Bottle2neck Block structure. It connects different filter groups using a residual block-like structure, allowing features extracted by each group to be passed on to the next, and finally concatenates all groups’ feature maps and sends them through a 1 × 1 filter for complete feature fusion. The combinatorial explosion effect generated by this structure enables the Res2Net architecture to include multi-scale receptive fields. Res2Net was chosen over other architectures like Efficient Net or Dense Net due to its superior ability to capture multi-scale features. Res2Net’s architecture, which employs a series of residual blocks with increasing depths and parallel branches, allows it to capture a richer set of features at various scales. This is particularly beneficial for indoor scenes where objects can vary significantly in size and complexity. Efficient Net, while efficient, focuses more on a balanced scaling of depth, width, and resolution, which may not be as effective for capturing the fine-grained details necessary for indoor object detection. Dense Net, with its dense connectivity pattern, can be computationally expensive and may suffer from diminishing returns in feature propagation, making it less suitable for our resource-constrained application. Depending on the number of layers, Res2Net has multiple structures. Without significantly increasing computational load while ensuring model depth, this paper selects Res2Net101 as the foundational model for the regression network. Due to the small batch size, Group normalization [
35] is used to replace the original Batch normalization, and the existing convolution layer Conv is replaced with ConvAWS [
36], thereby enhancing the network’s accuracy in indoor object recognition and reducing model computational overhead.
2.4. Deformable Attention Module
In the task of indoor object target recognition against complex backgrounds, objects within indoor environments may be partially or completely occluded by other objects, making it difficult to extract features from the occluded parts. Moreover, indoor backgrounds can be highly complex, containing various textures and colors, which may confuse object features and make distinguishing target objects from the background more challenging, thereby increasing the difficulty of detection. The Res2Net network demonstrates excellent performance in recognition and classification domains. However, for indoor object target recognition in complex scenes, subtle differences in features hinder the network’s ability to retrieve more detailed pixel information, which can degrade the performance of the backbone network. Additionally, DINO utilizes learnable vectors in four dimensions (x, y, w, h), emphasizing spatial information, whereas the convolution operations in the Res2Net network are confined to the plane, thereby losing spatial feature information.
To address these two issues, this study incorporates a deformable attention mechanism [
37] into the Bottle2neck block of the network. The deformable attention mechanism can adaptively adjust the shape and size of the attention window to better match the actual shape of the target object, particularly for irregular or tilted objects. By focusing on local features around the target, deformable attention provides more precise feature representations, thereby enhancing detection accuracy. Given its ability to adapt to geometric variations, the deformable attention mechanism is more robust in handling changes such as occlusion, rotation, and scaling. Compared to traditional global attention mechanisms, deformable attention only computes at key sampling points, reducing unnecessary computational load and improving model runtime speed. Considering model computational capacity, the Deformable Attention is added after the ConvAWS layer in the Bottle2neck block, allowing the deformable attention to dynamically adjust the positions of attention points based on input features, thereby enhancing the model’s receptive field and enabling it to better capture complex structures and details in the image. The deformable attention algorithm is shown in Equation (1):
where
is the query element’s two-dimensional reference point,
is the location of sampling, m is the center of attention, and
and
are the attention weight and offset, respectively, of the
k-th sampling point in the
m-th attention head. The local perception capabilities of the network are improved by the offset and attention weight.
Figure 4 tasks. step depicts the updated backbone network architecture.
2.5. GBi-FPN
In the context of indoor object detection tasks, the size and viewing angle of indoor objects can vary significantly, ranging from small items to large furniture, and observing the same object from different angles. This scale variation makes it challenging for a single detection model to effectively handle objects of all scales simultaneously. The encoder in DINO utilizes a Deformable Transformer, which, for the four feature maps of different scales produced by the backbone network, employs a 1 × 1 convolution operation with a stride of 1 to uniformly reduce the channels to 256. These feature maps of different scales often contain more information, with shallow features including finer-grained details and pixel-level localization accuracy, while deep features carry more accurate contextual and semantic information. Therefore, fully integrating low-level and high-level features can yield more useful feature representations and reduce the interference of irrelevant information.
Feature Pyramid Networks (FPNs) can address the issue of multi-scale object feature representation. FPN achieves this by downsampling and using 1 × 1 convolutions to adjust the size and number of channels of higher-level feature maps to match those of lower-level ones, followed by fusion with the lower-level features. Since the feature maps at the lower levels, closer to the input, have smaller receptive fields and extract more specific and local features, such as textures and shapes, while the higher-level feature maps have larger receptive fields and extract more abstract semantic information, combining the two can yield richer feature information.
Two essential components of Bidirectional Feature Pyramid Network (Bi-FPN), a new network topology for computer vision tasks including object identification and semantic segmentation, are weighted feature map fusion and effective bidirectional cross-scale connections [
38]. This means that Bi-FPN can effectively handle information interaction between features of different scales and perform effective fusion between different feature maps, thereby improving the model’s performance and efficiency. The bidirectional cross-scale connections are implemented through a top-down and bottom-up bidirectional pathway, ensuring that features across multiple scales can be fully integrated and propagated. During this process, Bi-FPN maintains the same feature resolution and adds lateral connections during upsampling and downsampling to effectively combine features of different scales without significantly increasing computational costs.
Additionally, Bi-FPN can serve as a basic unit, with a pair of pathways considered as a feature layer, iterated again to obtain a higher degree of high-level feature fusion. This modular design allows Bi-FPN to be easily integrated into different neural network architectures and flexibly applied to various computer vision tasks, therefore improving the model’s functionality and capacity for generalization. Its network structure is shown in
Figure 5. Due to the small batch size, a Group Normalization layer with 32 groups is used instead of the original Batch Normalization to improve the network’s accuracy in indoor object detection.
2.6. SIoU Loss
In the initial DINO model, the bounding box regression loss function utilizes GIoU Loss to impose constraints on the predicted bounding boxes. GIoU Loss, in computing the loss for object detection models, can more effectively account for the overlap, symmetry, and scale discrepancies between the actual bounding boxes and the predicted ones, thus enhancing the stability and efficacy of the training process. Nonetheless, GIoU struggles to differentiate the relative placements of the two boxes when the actual box fully contains the predicted box or the other way around. To mitigate these limitations, SIoU loss is adopted as the loss function for the model in this research [
39]. The intersection over union (SIoU) technique redefines the penalty metrics and accounts for the angle associated with the vector between the needed regressions when computing the intersection between bounding boxes. Four cost functions are included in SIoU’s innovative loss function, SIoU Loss: angle loss, distance loss, shape loss, and IoU loss. The goal of this loss function is to greatly increase object identification accuracy. The angle loss and distance loss penalize the angular and distance differences, respectively, between the two bounding boxes, whereas the shape loss penalizes the shape disparities between the predicted and ground truth bounding boxes. The IoU loss quantifies the overlap between the anticipated bounding boxes and the ground truth. Combination. By optimizing these four loss functions, SIoU Loss can improve the model’s stability and robustness while ensuring accuracy. When calculating the angle loss, an LF component is introduced and defined. The formula for the LF component is as follows:
Taking into account the aforementioned angle loss, the distance loss is redefined. The formula for the distance loss is as follows:
The definition of the shape loss is as follows:
The definition of the IoU loss is as follows:
Finally, the SIoU Loss formula is as follows:
Compared to GIoU Loss, SIoU Loss not only considers the overlap degree between detection boxes but also takes into account the distance between them, thereby providing a more accurate evaluation of the overlap degree between object detection boxes. Moreover, for smaller targets or situations where targets overlap significantly, SIoU Loss can better distinguish between different detection boxes, thereby enhancing the model’s stability and robustness while ensuring accuracy.
3. Results
The experimental hardware platform used is an Intel(R) Xeon(R) W-2245 CPU @ 3.90 GHz 3.91 GHz processor, with a memory configuration of 64 GB, and a GPU utilizing NVIDIA T5000 with 24 GB of VRAM, operating under a Windows 10 64-bit system environment. The model in this study relies on Python 3.7, Pytorch 1.9.0, with the Cuda version being 11.1, and utilizes the PyCharm deep learning open-source framework for environment configuration. Both the original and improved versions of the model in the experiments used the same hyperparameters. The learning rate (lr) was set to 0.0001, the batch size was set to 2, and an lr scheduler was used, taking into account both training efficacy and device performance. With 256 as the hidden feature dimension (Hidden dim), the model employed six layers of Transformer encoder and six layers of Transformer decoder. It then chose the Adam W optimizer, setting the weight decay rate to 0.0001 and the training iteration count (Epoch) to 15. Additionally, in order to update the pre-trained parameters to the enhanced network for training and speed up the model’s convergence, transfer learning was used in the tests to load the fresh pre-trained model weights of the backbone network Res2Net before training.
3.1. Evaluation Metrics
In object detection, common evaluation metrics for assessing model generalization capabilities include precision, recall, average precision (
AP), and mean average precision (
mAP). The formulas for calculating precision and recall are as follows:
TP stands for correctly classified positive samples;
FP stands for mistakenly labeled positive samples; and
FN stands for incorrectly classified negative samples. The average precision, or
AP, of the model considers item detection accuracy and recall in a comprehensive way.
mAP is used to quantify the average accuracy across all classes, and the formula to compute it is as follows:
3.2. Model Performance Analysis
First, we investigate the iterative and convergence effects of an indoor object detection model based on DINO on the Indoor-COCO dataset.
Figure 6 illustrates the
mAP values of different models during training using the training set. As shown, the performance of our model exhibits an upward trend throughout the training process. Due to the utilization of pre-trained weights, both the increase in
mAP and the decrease in loss values slow down after the 13th epoch. The model achieves its optimal performance at the 15th epoch, after which it completes its convergence. Through comparisons with different models, our model demonstrates the best performance.
3.3. Comparative Experiment
To analyze the actual performance of the indoor object detection model based on DINO on the Indoor-COCO dataset, a comparative experiment was conducted with classic object detectors: the anchor-based two-stage networks Faster R-CNN [
10], Grid RCNN [
40], Sparse RCNN [
41], Libra RCNN [
42], Cascade RCNN [
13], the one-stage networks YOLOv3 [
16], YOLOv8s, the anchor-free YOLOX [
18], and the DETR-based Deformable DETR [
43]. All models in this experiment used pre-trained weights, with Faster R-CNN, Grid RCNN, Sparse RCNN, Libra RCNN, and Cascade RCNN incorporating the FPN module, and Cascade RCNN replacing the backbone network and adding the DCNv2 deformable convolution module. The specific results of the comparative experiment are shown in
Table 1.
The results indicate that the improved DINO achieved an mAP of 62.3%, outperforming other object detection models in terms of performance. Faster R-CNN performed well at an IOU threshold of 50, with an mAP of 73.4% after 12 epochs. Cascade RCNN, after replacing the backbone network with Res2net101 and adding the DCNv2 deformable convolution module, achieved an mAP of 56.3% after nine epochs, showing a faster convergence rate. Throughout the training phase, YOLOv8s reported the loss value every 10 iterations, spanning a total of 100 iterations, and ultimately attained an mAP of 60.7%. The improved DINO model’s backbone network Res2Net101 outperformed the original DINO’s ResNet50 by 2.8 percentage points in mAP, demonstrating that different backbone networks and varying depths of the same backbone network have a significant impact on accuracy. Through the comparative experiment, it is evident that the algorithm improved in this study performs well in mAP, mAP_50, and mAP_75, achieving an enhancement in the recognition accuracy of indoor objects against complex backgrounds.
3.4. Ablation Experiment
Ablation tests were carried out in order to confirm the efficacy of every module inside the model.
Table 2 displays Group 1’s findings from the original DINO model, with mAP at 57.1%, mAP_50 at 76.1%, and AP75 at 62.8%. Group 2 indicates the replacement of the original backbone network with the improved Res2Net, resulting in increases of 2.8, 3.6, and 2.6 percentage points in mAP, mAP_50, and AP75, respectively. Group 3 builds upon Group 2 by incorporating the GBi-FPN module to enhance the original feature fusion module, integrating deeper features with existing shallow features, which also led to a slight improvement in accuracy at lower IOU thresholds, with the mAP metric increasing by 0.8%. Group 4, building upon Group 3, substituted the original loss function with the SIoU Loss function, which takes into account the vector angle between necessary regressions and redefines the penalty criteria to more precisely assess the extent of overlap between object detection boxes, yielding a 0.5 percentage point boost in mAP. Group 5, building on Group 4, incorporated a deformable attention mechanism module into the backbone network, which, by mitigating the influence of non-essential features, improved the capability of extracting salient features, resulting in a 1.1 percentage point increase in mAP relative to Group 4, despite limited enhancement at lower IOU thresholds. In summary, all four enhanced modules of DINO successfully improved the performance of indoor object detection in practical scenarios, confirming the efficacy of the feature extraction network, feature fusion module, loss function, and attention mechanism employed in this study for indoor object recognition amidst complex backgrounds.
3.5. Visualization Results
The comparison of object detection performance between the original DINO and the improved DINO is illustrated in
Figure 7. The original DINO model had an accuracy of only 57.1% for bounding box predictions. By contrasting the bounding boxes drawn by different models in the fig, it is evident that the original DINO model was prone to false positives and negatives when detecting objects with high similarity. In contrast, the improved DINO model used in this study demonstrated superior detection performance. The visualization results clearly demonstrate that the method presented in this chapter significantly enhances the model’s performance in complex lighting and occlusion scenarios.
In addition to the successful detection cases mentioned above, it is also crucial to examine the limitations of our proposed method. One of the primary issues observed in our model is the misdetection or missed detection of objects due to severe occlusion.
Figure 8 illustrates a scenario where a sofa is not detected due to significant obstruction by other furniture. The occlusion reduces the visibility of the sofa, causing the model to fail in recognizing its presence. The failure can be attributed to the occlusion preventing the model from extracting features of the couch effectively, leading to an inability to learn the feature representation of the occluded object. This highlights the necessity for further research into occlusion handling mechanisms, such as improved attention modules or occlusion-aware data augmentation techniques, to enhance the model’s robustness against occlusion.
Additionally,
Figure 9 presents the normalized confusion matrix for the classification rates of all indoor object categories by I-DINO. It can be seen from the confusion matrix that, as anticipated, I-DINO achieved the highest classification rates for the majority categories such as Toilet, TV, Refrigerator, and Microwave. Nevertheless, I-DINO also obtained commendable classification rates for the minority categories like Sink, Chair, Couch, and Bed. These results directly demonstrate the robust capability of the method in addressing the issue of class imbalance.
4. Conclusions
This study addresses the challenges of indoor object detection against complex backgrounds, including diverse targets, significant lighting variations, occlusions, and extreme scale changes. The research builds upon the DINO model, incorporating a multi-scale backbone network, Res2Net, with increased depth, and introducing the GBi-FPN feature fusion module to bolster the connection between low-level and high-level feature information, effectively utilizing multi-scale features and mitigating the issue of large-scale variations in indoor objects. To counteract the impact of occlusions on detection accuracy, a deformable attention mechanism was employed to reduce the influence of secondary feature information and enhance the grasp of important features. Ablation experiments and comparative experiments with other classical algorithms on the Indoor-COCO dataset demonstrated that the improved model outperformed the original model in all aspects, achieving a favorable recognition effect with the mAP of 62.3%.
The field of computer vision research on indoor item identification is growing, with important implications for applications such as security monitoring, smart homes, and assistive robotics. The mAP of 62.3% indicates robust detection capabilities, but it also highlights areas for further improvement, especially in scenarios with high occlusion or small object detection. Moreover, the practicality of our model is not only defined by its accuracy but also by its computational efficiency. We have optimized our model to balance accuracy and speed, achieving a reasonable inference time that is crucial for real-time applications. However, the computational cost is a factor that may limit the model’s applicability in resource-constrained environments. Future work will focus on enhancing the model’s efficiency without compromising its accuracy, potentially through lightweight model variants or optimized inference engines. Additionally, the introduction of a masking mechanism into the model to generate mask matrices for filtering out irrelevant features, while extending the model to instance segmentation tasks, will aim to improve both model efficiency and accuracy without compromising object recognition precision. Furthermore, enriching the model’s recognition capabilities and delving into the precise identification of indoor objects in various complex scenarios will further enhance the model’s practical application value and performance.