I-DINO: High-Quality Object Detection for Indoor Scenes

Fan, Zhipeng; Mei, Wanglong; Liu, Wei; Chen, Ming; Qiu, Zeguo

doi:10.3390/electronics13224419

Open AccessArticle

I-DINO: High-Quality Object Detection for Indoor Scenes

by

Zhipeng Fan

^1,2,3,*,

Wanglong Mei

^1,2,

Wei Liu

^1,2,

Ming Chen

^1,2 and

Zeguo Qiu

^1,2,3

¹

School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China

²

Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin 150028, China

³

Postdoctoral Research Workstation of Northeast Asia Service Outsourcing Research Center, Harbin University of Commerce, Harbin 150028, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4419; https://doi.org/10.3390/electronics13224419

Submission received: 14 October 2024 / Revised: 6 November 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

Object Detection in Complex Indoor Scenes is designed to identify and categorize objects in indoor settings, with applications in areas such as smart homes, security surveillance, and home service robots. It forms the basis for advanced visual tasks including visual question answering, video description generation, and instance segmentation. Nonetheless, the task faces substantial hurdles due to background clutter, overlapping objects, and significant size differences. To tackle these challenges, this study introduces an indoor object detection approach utilizing an enhanced DINO framework. To cater to the needs of indoor object detection, an Indoor-COCO dataset was developed from the COCO object detection dataset. The model incorporates an advanced Res2net as the backbone feature extraction network, complemented by a deformable attention mechanism to better capture detailed object features. An upgraded Bi-FPN module is employed to replace the conventional feature fusion module, and SIoU loss is utilized to expedite convergence. The experimental outcomes indicate that the refined model attains an mAP of 62.3%, marking a 5.2% improvement over the baseline model. These findings illustrate that the DINO-based indoor object detection model exhibits robust generalization abilities and practical utility for multi-scale object detection in complex environments.

Keywords:

object detection; deep learning; complex indoor scenes; attention mechanism; feature fusion

1. Introduction

Object detection plays a pivotal role in the field of computer vision and underpins advanced visual tasks like behavior comprehension, scene interpretation, and visual question answering. This technology is extensively utilized in diverse applications ranging from everyday safety, robot navigation, and video surveillance to scene detection and aerospace. Despite its widespread use, object detection continues to pose difficulties due to factors including object occlusion, substantial scale differences, and background clutter. The pursuit of robust and efficient object detection methods carries substantial practical significance.

Indoor scene object detection is an important field within object detection [1], with broad application prospects in smart home devices, indoor service robots, and security monitors [2]. Indoor complex scenes refer to indoor environments with complex background interference, object occlusion, and large-scale variations [3]. Compared to traditional outdoor scene object detection, indoor complex scene object detection is more difficult, making it a cutting-edge research topic.

Two primary phases of research have been conducted on indoor object detection: the first used hand-crafted feature-based algorithms, and the second involved deep learning-based methods. Techniques that rely on handmade features frequently make use of background data, such as texture, color, and shape, to create shallow features that are then used for object detection and localization. These methods have the advantage of strong interpretability. To effectively address these challenges, researchers have conducted extensive work in object detection. Traditional object detection algorithms focused primarily on feature design, with researchers both domestically and internationally proposing a series of excellent feature operators, such as SIFT [4], HOG [5], and LBP [6]. These manually designed features achieved good results in certain specific scenarios but lacked universality and limited representational power for objects, making it difficult to describe objects with high-level semantic information and thus not applicable to all complex scenes.

Deep learning-based techniques have improved detection outcomes recently by using convolutional neural networks to extract features, which are better at expressing semantic information. These systems may be roughly categorized into two types: one-stage algorithms based on the regression concept and two-stage algorithms based on the candidate region idea, depending on whether the detection procedure includes the candidate area proposal phase. Two-stage algorithms include typical methods such as R-CNN, Fast-RCNN, SPP-Net, Faster-RCNN, R-FCN, Mask-RCNN, and Cascade-RCNN [7,8,9,10,11,12,13]. One-stage algorithms directly produce class probabilities and position coordinates of objects without generating candidate regions, with typical methods including YOLOv1, YOLOv2, YOLOv3, YOLOv4, YOLOX, YOLOv7, YOLOv9, SSD, DSSD, and Retina Net [14,15,16,17,18,19,20,21,22,23]. Convolutional neural network-based object detection models have strong fitting and generalization capabilities and have been deeply applied in indoor object recognition tasks. Saurabh Gupta et al. [24] proposed an algorithm based on the gPb-ucm method, which improved object boundary detection and hierarchical segmentation using depth information and used RGB-D images for indoor scene understanding. Oliver Mattausch et al. [25] proposed an automatic segmentation method for indoor scenes using spectral clustering to identify object categories for the automatic detection and classification of objects in large-scale indoor scan point clouds. Georgios Georgakis et al. [26] proposed an automated method to generate synthetic training data, considering both geometric and semantic information, aiming to address the object detection problem for service robots in indoor environments. XiaoYu Yao [27] proposed an optimization algorithm based on the regularity of indoor scene object attributes, combining prior information with detection results, and utilizing the ability of recurrent convolutional neural networks (RCNNs) to process sequential data to extract global image information and improve the accuracy of indoor object detection. JunJie Chen [28] proposed an indoor complex scene object detection model based on cross-attention mechanisms, proposing sample adaptive mixed data augmentation methods based on classification confidence and indoor complex scene object detection algorithms for few-shot learning, improving the performance of object detection in indoor complex scenes. Employing an advanced dual-threshold non-maximum suppression (DT-NMS) algorithm to address occlusion problems in indoor environments, Ni et al. [29] introduced an SSD-based object detection method that enhances indoor object recognition at the regional concept stage. The MCIE module is utilized to gather contextually significant information from indoor settings. KPA Gladis et al. [30]. proposed the Indoor-Outdoor YOLO Glass network (In-Out YOLO), an object detection model-based video system, with innovations in adaptive spatial pyramid pooling-based Squeeze and Attention Block with YOLO, to assist visually impaired individuals in navigating and identifying objects both indoors and outdoors, addressing the challenges of low-power wearable devices and improving their independence and quality of life.

In recent years, the detection transformer (DETR) architecture based on the Transformer structure [31] has become a new paradigm for object detection. These models abandon performance-limiting handcrafted modules and rely on attention mechanisms to match queries with features, treating object detection as an end-to-end set prediction task. They adopt a hybrid structure of Transformer and neural networks, viewing object detection as a set prediction problem, decoupling the relationship between position and prediction, simplifying the entire object detecting procedure, eliminating operations such as NMS that affect detection performance, and demonstrating strong potential in tasks involving complex scales and dense objects due to their better global modeling capabilities. He et al. [32] separated cross-attention into two separate branches, initializing content queries in the decoder with a little detector, improving content queries in the early stages of training by predicting classification and regression embeddings, and introducing a pairwise self-attention mechanism in the decoder to consider spatially adjacent object query pairs and utilize their spatial context information. ZHANG et al. [33] proposed the DINO model based on Deformable DETR. DINO employs dynamic anchor boxes (DABs) and denoising training (DN) techniques, using deformable attention mechanisms to improve computational efficiency. By simultaneously presenting the model with positive samples (small noise) and negative samples (large noise) of the same real target, the model is helped to distinguish and avoid repeated detection of the same target. Combining the localization queries from the encoder output with learnable content-rich queries improves the initialization process of anchor boxes and uses refinement box information from subsequent layers to optimize the parameters of adjacent earlier layers, improving the accuracy of box prediction.

However, there are still some difficulties in practical applications:

Indoor environments often have more complex backgrounds and diverse targets. Frequently, due to factors like lighting and occlusion, the shapes and sizes of the same type of objects vary greatly, leading to a reduction in detection accuracy.
Obtaining annotated data for indoor object detection is inherently challenging due to the complexity of indoor scenes. The annotation cost is exacerbated by the diverse and cluttered nature of indoor environments, where objects can be partially or fully occluded, and backgrounds can vary significantly. Furthermore, the frequency of occurrence of certain objects within these environments often skews, leading to class imbalance. This imbalance can cause models to favor the detection of frequently occurring objects during training, thereby neglecting less common objects and resulting in lower detection accuracy for rare instances. The need for precise localization and the high variability in object arrangements contribute to the difficulty and cost of annotation, making it more challenging compared to other scenarios where objects are more uniformly distributed and less occluded.
Indoor environments exhibit a wide range of object scales, posing a significant challenge for object detection models. This scale variation is due to the presence of both small, everyday items such as keys and phones, and large pieces of furniture like sofas and bookshelves. The size difference between these objects can span several orders of magnitude, which complicates the detection process. Small objects require high-resolution detection capabilities, while large objects demand the model to maintain accuracy across a broader spatial context. This variability in scale is more pronounced in indoor settings compared to outdoor scenes, where objects tend to be more uniformly distributed in size and perspective.

This scale variation poses a significant challenge for object detection models, making it difficult for them to effectively handle objects of all scales simultaneously. Therefore, achieving robust and efficient object detection algorithms in indoor complex scenes is of paramount importance.

To tackle these challenges, this study introduces a novel object detection methodology tailored for indoor complex environments. Unlike existing methods, our approach leverages an enhanced DINO model, which employs an advanced Res2net as the backbone for feature extraction. This backbone is integrated with deformable attention mechanisms to capture salient feature information more effectively, aiding in the differentiation of similar class objects from their backgrounds. Additionally, our method incorporates an improved feature pyramid network, GBi-FPN, in the neck to facilitate more accurate recognition post-fusion of deeper features. To expedite the convergence process, we utilize SIoU loss, which enables precise identification of indoor objects amidst complex backgrounds. This research offers a groundbreaking solution for indoor complex scene object detection applications, including smart homes, indoor service robots, and security surveillance systems, demonstrating significant improvements over current methods in terms of accuracy, efficiency, and robustness.

2. Materials and Methods

2.1. Datasets

The currently available indoor scene object detection datasets, such as NYU-Depth V2 and SUNRGB-D, primarily include depth information and are designed for three-dimensional reconstruction tasks, making them unsuitable for our research focus on two-dimensional objects. NYU-Depth V2, with its limited 1449 images and only six distinct object categories, severely restricts the generalization capabilities of object detection models across diverse indoor settings. SUNRGB-D, while offering a more extensive range of 32 object categories, suffers from a significant imbalance in the distribution of these categories across its 10,345 images, leading to model biases towards frequently occurring objects and compromising detection accuracy for less common ones. Even datasets like GMU-Kitchen Scenes, which can be used for two-dimensional object detection, are captured via depth cameras and annotated by selecting frames from video clips. Most of these annotated images originate from the same video segment, leading to highly similar issues. Additionally, these datasets suffer from problems such as limited scenes, single shooting angles, and insufficient categories and quantities of objects, all of which do not meet the requirements of our indoor complex scene object detection task. Therefore, it is necessary to construct a high-quality dataset with rich categories and ample sample sizes.

The COCO object detection dataset contains 80 different classes of objects and over 200,000 labeled images, making it the most widely used dataset for object detection. The COCO dataset includes common indoor objects such as furniture, appliances, and people, and encompasses complex scenes. Therefore, we have constructed a dataset suitable for our research application scenarios based on the COCO dataset. As shown in the Figure 1, we have built an Indoor Complex Scene Object Detection Dataset (referred to as Indoor-COCO hereafter) based on the COCO object detection dataset. The dataset construction process is as follows: We programmatically and manually selected complex indoor scene images from the COCO object detection dataset that met conditions such as complex backgrounds, significant object occlusion, and large-scale variations. This resulted in a dataset containing 11,302 images and eight categories. The table displays the statistical distribution of the dataset’s categories. The training and test sets were divided using an 8:2 ratio through random allocation, resulting in 2261 photographs for the test set and 9041 images for the training set. The dataset we developed meets the criteria for indoor complex scene object recognition tasks, as it encompasses a wide range of categories, boasts a substantial number of samples, features intricate scenes, and includes diverse viewing angles.

2.2. Indoor Object Detection Model Based on DINO

The design of the I-DINO (Indoor-DINO)-based Indoor Object Detection Model, as illustrated in Figure 2, typically consists of a backbone network, a neck, multiple Transformer encoder and decoder layers, and numerous prediction heads. The input image undergoes feature extraction through an advanced Res2Net backbone network, which enhances feature extraction by incorporating multi-scale receptive fields, crucial for detecting objects of varying sizes within cluttered indoor scenes. This backbone network, with increased depth, altered convolution and normalization layers, and integrated pre-trained weights, extracts four feature maps at different scales. To combat occlusions and capture detailed features of objects, especially in complex indoor backgrounds, the backbone network incorporates a deformable attention mechanism (Deformable Attention) module. This mechanism adaptively adjusts the attention window to align with object contours, thereby improving the model’s ability to handle occlusions and complex backgrounds. The extracted feature maps are then fed into the neck feature fusion module, GBi-FPN (Grid Bi-directional Feature Pyramid Network). This improved feature fusion module integrates features from different scales, providing a comprehensive representation that mitigates the impact of scale variations in indoor objects. The enhanced semantic information accuracy through feature fusion is crucial for accurate object detection in diverse indoor environments. Subsequently, the fused feature maps and positional information are input into the deformable Transformer’s encoder and decoder. The encoder processes the feature maps to capture contextual information, while the decoder refines the features to generate precise object proposals. Finally, image features extracted near the reference points are sent to multiple prediction heads (Prediction heads) to complete the end-to-end precise recognition process. To further improve the accuracy of bounding box predictions, especially in scenarios with significant object overlap, the model employs the SIoU Loss Function. This loss function considers not only the overlap area but also the shape and orientation of the boxes, ensuring more accurate localization and detection of objects in complex indoor scenes.

2.3. Res2Net Network

Res2Net is a multi-scale backbone network architecture first proposed by GAO et al. [34], originating from an improvement on the ResNet structure. Its multi-scale nature stems from multiple receptive fields at a finer granularity level, distinguishing it from methods that increase multi-scale by utilizing different resolutions. As illustrated in Figure 3, Res2Net divides the 3 × 3 filters in the original ResNet’s Bottleneck Block structure into n smaller filter groups, forming the Bottle2neck Block structure. It connects different filter groups using a residual block-like structure, allowing features extracted by each group to be passed on to the next, and finally concatenates all groups’ feature maps and sends them through a 1 × 1 filter for complete feature fusion. The combinatorial explosion effect generated by this structure enables the Res2Net architecture to include multi-scale receptive fields. Res2Net was chosen over other architectures like Efficient Net or Dense Net due to its superior ability to capture multi-scale features. Res2Net’s architecture, which employs a series of residual blocks with increasing depths and parallel branches, allows it to capture a richer set of features at various scales. This is particularly beneficial for indoor scenes where objects can vary significantly in size and complexity. Efficient Net, while efficient, focuses more on a balanced scaling of depth, width, and resolution, which may not be as effective for capturing the fine-grained details necessary for indoor object detection. Dense Net, with its dense connectivity pattern, can be computationally expensive and may suffer from diminishing returns in feature propagation, making it less suitable for our resource-constrained application. Depending on the number of layers, Res2Net has multiple structures. Without significantly increasing computational load while ensuring model depth, this paper selects Res2Net101 as the foundational model for the regression network. Due to the small batch size, Group normalization [35] is used to replace the original Batch normalization, and the existing convolution layer Conv is replaced with ConvAWS [36], thereby enhancing the network’s accuracy in indoor object recognition and reducing model computational overhead.

2.4. Deformable Attention Module

In the task of indoor object target recognition against complex backgrounds, objects within indoor environments may be partially or completely occluded by other objects, making it difficult to extract features from the occluded parts. Moreover, indoor backgrounds can be highly complex, containing various textures and colors, which may confuse object features and make distinguishing target objects from the background more challenging, thereby increasing the difficulty of detection. The Res2Net network demonstrates excellent performance in recognition and classification domains. However, for indoor object target recognition in complex scenes, subtle differences in features hinder the network’s ability to retrieve more detailed pixel information, which can degrade the performance of the backbone network. Additionally, DINO utilizes learnable vectors in four dimensions (x, y, w, h), emphasizing spatial information, whereas the convolution operations in the Res2Net network are confined to the plane, thereby losing spatial feature information.

To address these two issues, this study incorporates a deformable attention mechanism [37] into the Bottle2neck block of the network. The deformable attention mechanism can adaptively adjust the shape and size of the attention window to better match the actual shape of the target object, particularly for irregular or tilted objects. By focusing on local features around the target, deformable attention provides more precise feature representations, thereby enhancing detection accuracy. Given its ability to adapt to geometric variations, the deformable attention mechanism is more robust in handling changes such as occlusion, rotation, and scaling. Compared to traditional global attention mechanisms, deformable attention only computes at key sampling points, reducing unnecessary computational load and improving model runtime speed. Considering model computational capacity, the Deformable Attention is added after the ConvAWS layer in the Bottle2neck block, allowing the deformable attention to dynamically adjust the positions of attention points based on input features, thereby enhancing the model’s receptive field and enabling it to better capture complex structures and details in the image. The deformable attention algorithm is shown in Equation (1):

\begin{array}{l} D e f o r m a b l e A t t e n t i o n (Z q, P q, X) \\ = (\sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{k} A_{m q k} * W_{m}^{'} x (P q + Δ P m q k)]) \end{array}

(1)

where

P q

is the query element’s two-dimensional reference point,

x \in R^{c * H * w}

is the location of sampling, m is the center of attention, and

A_{m q k}

and

Δ P m q k

are the attention weight and offset, respectively, of the k-th sampling point in the m-th attention head. The local perception capabilities of the network are improved by the offset and attention weight. Figure 4 tasks. step depicts the updated backbone network architecture.

2.5. GBi-FPN

In the context of indoor object detection tasks, the size and viewing angle of indoor objects can vary significantly, ranging from small items to large furniture, and observing the same object from different angles. This scale variation makes it challenging for a single detection model to effectively handle objects of all scales simultaneously. The encoder in DINO utilizes a Deformable Transformer, which, for the four feature maps of different scales produced by the backbone network, employs a 1 × 1 convolution operation with a stride of 1 to uniformly reduce the channels to 256. These feature maps of different scales often contain more information, with shallow features including finer-grained details and pixel-level localization accuracy, while deep features carry more accurate contextual and semantic information. Therefore, fully integrating low-level and high-level features can yield more useful feature representations and reduce the interference of irrelevant information.

Feature Pyramid Networks (FPNs) can address the issue of multi-scale object feature representation. FPN achieves this by downsampling and using 1 × 1 convolutions to adjust the size and number of channels of higher-level feature maps to match those of lower-level ones, followed by fusion with the lower-level features. Since the feature maps at the lower levels, closer to the input, have smaller receptive fields and extract more specific and local features, such as textures and shapes, while the higher-level feature maps have larger receptive fields and extract more abstract semantic information, combining the two can yield richer feature information.

Two essential components of Bidirectional Feature Pyramid Network (Bi-FPN), a new network topology for computer vision tasks including object identification and semantic segmentation, are weighted feature map fusion and effective bidirectional cross-scale connections [38]. This means that Bi-FPN can effectively handle information interaction between features of different scales and perform effective fusion between different feature maps, thereby improving the model’s performance and efficiency. The bidirectional cross-scale connections are implemented through a top-down and bottom-up bidirectional pathway, ensuring that features across multiple scales can be fully integrated and propagated. During this process, Bi-FPN maintains the same feature resolution and adds lateral connections during upsampling and downsampling to effectively combine features of different scales without significantly increasing computational costs.

Additionally, Bi-FPN can serve as a basic unit, with a pair of pathways considered as a feature layer, iterated again to obtain a higher degree of high-level feature fusion. This modular design allows Bi-FPN to be easily integrated into different neural network architectures and flexibly applied to various computer vision tasks, therefore improving the model’s functionality and capacity for generalization. Its network structure is shown in Figure 5. Due to the small batch size, a Group Normalization layer with 32 groups is used instead of the original Batch Normalization to improve the network’s accuracy in indoor object detection.

2.6. SIoU Loss

In the initial DINO model, the bounding box regression loss function utilizes GIoU Loss to impose constraints on the predicted bounding boxes. GIoU Loss, in computing the loss for object detection models, can more effectively account for the overlap, symmetry, and scale discrepancies between the actual bounding boxes and the predicted ones, thus enhancing the stability and efficacy of the training process. Nonetheless, GIoU struggles to differentiate the relative placements of the two boxes when the actual box fully contains the predicted box or the other way around. To mitigate these limitations, SIoU loss is adopted as the loss function for the model in this research [39]. The intersection over union (SIoU) technique redefines the penalty metrics and accounts for the angle associated with the vector between the needed regressions when computing the intersection between bounding boxes. Four cost functions are included in SIoU’s innovative loss function, SIoU Loss: angle loss, distance loss, shape loss, and IoU loss. The goal of this loss function is to greatly increase object identification accuracy. The angle loss and distance loss penalize the angular and distance differences, respectively, between the two bounding boxes, whereas the shape loss penalizes the shape disparities between the predicted and ground truth bounding boxes. The IoU loss quantifies the overlap between the anticipated bounding boxes and the ground truth. Combination. By optimizing these four loss functions, SIoU Loss can improve the model’s stability and robustness while ensuring accuracy. When calculating the angle loss, an LF component is introduced and defined. The formula for the LF component is as follows:

Λ = 1 - 2 * \sin^{2} (\arcsin (x) - \frac{π}{4})

(2)

x = \frac{C h}{σ} = \sin (α)

(3)

σ = \sqrt{{(b_{c_{x}}^{g t} - b_{c_{x}})}^{2} + {(b_{c_{y}}^{g t} - b_{c_{y}})}^{2}}

(4)

c_{h} = \max (b_{c_{y}}^{g t}, b_{c_{y}}) - \min (b_{c_{y}}^{g t}, b_{c_{y}})

(5)

Taking into account the aforementioned angle loss, the distance loss is redefined. The formula for the distance loss is as follows:

Δ = \sum_{t = x, y} (1 - e^{- γ ρ t})

(6)

ρ_{x} = {(\frac{b_{c_{x}}^{g t} - b_{c_{x}}}{c_{w}})}^{2}, ρ_{y} = {(\frac{b_{c_{y}}^{g t} - b_{c_{y}}}{c_{h}})}^{2}, γ = 2 - Λ

(7)

The definition of the shape loss is as follows:

Ω = \sum_{t = x, y} {(1 - e^{- ω t})}^{θ}

(8)

ω_{w} = \frac{|w - w^{g t}|}{\max (w, w^{g t})}, ω_{h} = \frac{|h - h^{g t}|}{\max (h, h^{g t})}

(9)

The definition of the IoU loss is as follows:

I o U = \frac{|B \cap B^{G T}|}{|B \cup B^{G T}|}

(10)

Finally, the SIoU Loss formula is as follows:

L_{b o x} = 1 - I o U + \frac{Δ + Ω}{2}

(11)

Compared to GIoU Loss, SIoU Loss not only considers the overlap degree between detection boxes but also takes into account the distance between them, thereby providing a more accurate evaluation of the overlap degree between object detection boxes. Moreover, for smaller targets or situations where targets overlap significantly, SIoU Loss can better distinguish between different detection boxes, thereby enhancing the model’s stability and robustness while ensuring accuracy.

3. Results

The experimental hardware platform used is an Intel(R) Xeon(R) W-2245 CPU @ 3.90 GHz 3.91 GHz processor, with a memory configuration of 64 GB, and a GPU utilizing NVIDIA T5000 with 24 GB of VRAM, operating under a Windows 10 64-bit system environment. The model in this study relies on Python 3.7, Pytorch 1.9.0, with the Cuda version being 11.1, and utilizes the PyCharm deep learning open-source framework for environment configuration. Both the original and improved versions of the model in the experiments used the same hyperparameters. The learning rate (lr) was set to 0.0001, the batch size was set to 2, and an lr scheduler was used, taking into account both training efficacy and device performance. With 256 as the hidden feature dimension (Hidden dim), the model employed six layers of Transformer encoder and six layers of Transformer decoder. It then chose the Adam W optimizer, setting the weight decay rate to 0.0001 and the training iteration count (Epoch) to 15. Additionally, in order to update the pre-trained parameters to the enhanced network for training and speed up the model’s convergence, transfer learning was used in the tests to load the fresh pre-trained model weights of the backbone network Res2Net before training.

3.1. Evaluation Metrics

In object detection, common evaluation metrics for assessing model generalization capabilities include precision, recall, average precision (AP), and mean average precision (mAP). The formulas for calculating precision and recall are as follows:

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{T P + F N}

(13)

TP stands for correctly classified positive samples; FP stands for mistakenly labeled positive samples; and FN stands for incorrectly classified negative samples. The average precision, or AP, of the model considers item detection accuracy and recall in a comprehensive way. mAP is used to quantify the average accuracy across all classes, and the formula to compute it is as follows:

A P = \int_{0}^{1} P (R) d R

(14)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(15)

3.2. Model Performance Analysis

First, we investigate the iterative and convergence effects of an indoor object detection model based on DINO on the Indoor-COCO dataset. Figure 6 illustrates the mAP values of different models during training using the training set. As shown, the performance of our model exhibits an upward trend throughout the training process. Due to the utilization of pre-trained weights, both the increase in mAP and the decrease in loss values slow down after the 13th epoch. The model achieves its optimal performance at the 15th epoch, after which it completes its convergence. Through comparisons with different models, our model demonstrates the best performance.

3.3. Comparative Experiment

To analyze the actual performance of the indoor object detection model based on DINO on the Indoor-COCO dataset, a comparative experiment was conducted with classic object detectors: the anchor-based two-stage networks Faster R-CNN [10], Grid RCNN [40], Sparse RCNN [41], Libra RCNN [42], Cascade RCNN [13], the one-stage networks YOLOv3 [16], YOLOv8s, the anchor-free YOLOX [18], and the DETR-based Deformable DETR [43]. All models in this experiment used pre-trained weights, with Faster R-CNN, Grid RCNN, Sparse RCNN, Libra RCNN, and Cascade RCNN incorporating the FPN module, and Cascade RCNN replacing the backbone network and adding the DCNv2 deformable convolution module. The specific results of the comparative experiment are shown in Table 1.

The results indicate that the improved DINO achieved an mAP of 62.3%, outperforming other object detection models in terms of performance. Faster R-CNN performed well at an IOU threshold of 50, with an mAP of 73.4% after 12 epochs. Cascade RCNN, after replacing the backbone network with Res2net101 and adding the DCNv2 deformable convolution module, achieved an mAP of 56.3% after nine epochs, showing a faster convergence rate. Throughout the training phase, YOLOv8s reported the loss value every 10 iterations, spanning a total of 100 iterations, and ultimately attained an mAP of 60.7%. The improved DINO model’s backbone network Res2Net101 outperformed the original DINO’s ResNet50 by 2.8 percentage points in mAP, demonstrating that different backbone networks and varying depths of the same backbone network have a significant impact on accuracy. Through the comparative experiment, it is evident that the algorithm improved in this study performs well in mAP, mAP_50, and mAP_75, achieving an enhancement in the recognition accuracy of indoor objects against complex backgrounds.

3.4. Ablation Experiment

Ablation tests were carried out in order to confirm the efficacy of every module inside the model. Table 2 displays Group 1’s findings from the original DINO model, with mAP at 57.1%, mAP_50 at 76.1%, and AP75 at 62.8%. Group 2 indicates the replacement of the original backbone network with the improved Res2Net, resulting in increases of 2.8, 3.6, and 2.6 percentage points in mAP, mAP_50, and AP75, respectively. Group 3 builds upon Group 2 by incorporating the GBi-FPN module to enhance the original feature fusion module, integrating deeper features with existing shallow features, which also led to a slight improvement in accuracy at lower IOU thresholds, with the mAP metric increasing by 0.8%. Group 4, building upon Group 3, substituted the original loss function with the SIoU Loss function, which takes into account the vector angle between necessary regressions and redefines the penalty criteria to more precisely assess the extent of overlap between object detection boxes, yielding a 0.5 percentage point boost in mAP. Group 5, building on Group 4, incorporated a deformable attention mechanism module into the backbone network, which, by mitigating the influence of non-essential features, improved the capability of extracting salient features, resulting in a 1.1 percentage point increase in mAP relative to Group 4, despite limited enhancement at lower IOU thresholds. In summary, all four enhanced modules of DINO successfully improved the performance of indoor object detection in practical scenarios, confirming the efficacy of the feature extraction network, feature fusion module, loss function, and attention mechanism employed in this study for indoor object recognition amidst complex backgrounds.

3.5. Visualization Results

The comparison of object detection performance between the original DINO and the improved DINO is illustrated in Figure 7. The original DINO model had an accuracy of only 57.1% for bounding box predictions. By contrasting the bounding boxes drawn by different models in the fig, it is evident that the original DINO model was prone to false positives and negatives when detecting objects with high similarity. In contrast, the improved DINO model used in this study demonstrated superior detection performance. The visualization results clearly demonstrate that the method presented in this chapter significantly enhances the model’s performance in complex lighting and occlusion scenarios.

In addition to the successful detection cases mentioned above, it is also crucial to examine the limitations of our proposed method. One of the primary issues observed in our model is the misdetection or missed detection of objects due to severe occlusion. Figure 8 illustrates a scenario where a sofa is not detected due to significant obstruction by other furniture. The occlusion reduces the visibility of the sofa, causing the model to fail in recognizing its presence. The failure can be attributed to the occlusion preventing the model from extracting features of the couch effectively, leading to an inability to learn the feature representation of the occluded object. This highlights the necessity for further research into occlusion handling mechanisms, such as improved attention modules or occlusion-aware data augmentation techniques, to enhance the model’s robustness against occlusion.

Additionally, Figure 9 presents the normalized confusion matrix for the classification rates of all indoor object categories by I-DINO. It can be seen from the confusion matrix that, as anticipated, I-DINO achieved the highest classification rates for the majority categories such as Toilet, TV, Refrigerator, and Microwave. Nevertheless, I-DINO also obtained commendable classification rates for the minority categories like Sink, Chair, Couch, and Bed. These results directly demonstrate the robust capability of the method in addressing the issue of class imbalance.

4. Conclusions

This study addresses the challenges of indoor object detection against complex backgrounds, including diverse targets, significant lighting variations, occlusions, and extreme scale changes. The research builds upon the DINO model, incorporating a multi-scale backbone network, Res2Net, with increased depth, and introducing the GBi-FPN feature fusion module to bolster the connection between low-level and high-level feature information, effectively utilizing multi-scale features and mitigating the issue of large-scale variations in indoor objects. To counteract the impact of occlusions on detection accuracy, a deformable attention mechanism was employed to reduce the influence of secondary feature information and enhance the grasp of important features. Ablation experiments and comparative experiments with other classical algorithms on the Indoor-COCO dataset demonstrated that the improved model outperformed the original model in all aspects, achieving a favorable recognition effect with the mAP of 62.3%.

The field of computer vision research on indoor item identification is growing, with important implications for applications such as security monitoring, smart homes, and assistive robotics. The mAP of 62.3% indicates robust detection capabilities, but it also highlights areas for further improvement, especially in scenarios with high occlusion or small object detection. Moreover, the practicality of our model is not only defined by its accuracy but also by its computational efficiency. We have optimized our model to balance accuracy and speed, achieving a reasonable inference time that is crucial for real-time applications. However, the computational cost is a factor that may limit the model’s applicability in resource-constrained environments. Future work will focus on enhancing the model’s efficiency without compromising its accuracy, potentially through lightweight model variants or optimized inference engines. Additionally, the introduction of a masking mechanism into the model to generate mask matrices for filtering out irrelevant features, while extending the model to instance segmentation tasks, will aim to improve both model efficiency and accuracy without compromising object recognition precision. Furthermore, enriching the model’s recognition capabilities and delving into the precise identification of indoor objects in various complex scenarios will further enhance the model’s practical application value and performance.

Author Contributions

Conceptualization, W.M., Z.F. and W.L.; methodology, W.M., Z.F. and W.L.; validation, W.M. and M.C.; investigation, W.M., Z.F. and M.C.; resources, W.M.; data curation, W.M. and Z.F.; writing—original draft preparation, W.M., Z.F. and W.L.; writing—review and editing, W.M., Z.F. and W.L.; visualization, W.M. and Z.F.; supervision, Z.F. and Z.Q.; project administration, Z.F.; funding acquisition, Z.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Heilongjiang Postdoctoral Fund to pursue scientific research (grant number LBH-Z23025), Heilongjiang Province Colleges and Universities Basic Scientific Research Business Expenses Project (grant number 2023-KYYWF-1052), Harbin University of Commerce Industrialization Project (grant number 22CZ04), and Collaborative Innovation Achievement Program of Double First-class Disciplines in Heilongjiang Province (grant number LJGXCG2022-085).

Data Availability Statement

The data provided in this study is available upon request by contacting the corresponding author. This study utilized a self-built dataset, which can be downloaded from the following links: https://pan.baidu.com/s/1ZzSpsTQlVvST4H0WRodEIQ (accessed on 13 October 2024).

Acknowledgments

Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing provide academic support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fang, Q.; Xu, X.; Wang, X.; Zeng, Y. Target-driven visual navigation in indoor scenes using reinforcement learning and imitation learning. CAAI Trans. Intell. Technol. 2022, 7, 167–176. [Google Scholar] [CrossRef]
Alabachi, S.; Sukthankar, G.; Sukthankar, R. Customizing object detectors for indoor robots. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8318–8324. [Google Scholar]
Afif, M.; Ayachi, R.; Said, Y.; Atri, M. Deep learning based application for indoor scene recognition. Neural Process. Lett. 2020, 51, 2827–2837. [Google Scholar] [CrossRef]
Lowe, D.G. Method and Apparatus for Identifying Scale Invariant Features in an Image and Use of Same for Locating an Object in an Image. U.S. Patent 6711293, 23 March 2004. [Google Scholar]
Navneet, D. Histograms of oriented gradients for human detection. In Proceedings of the International Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; Volume 2, pp. 886–893. [Google Scholar]
Pietikainen, M.; Harwood, D. A comparative study of texture measures with classification based on feature distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Gupta, S.; Arbeláez, P.; Girshick, R.; Malik, J. Indoor scene understanding with rgb-d images: Bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 2015, 112, 133–149. [Google Scholar] [CrossRef]
Mattausch, O.; Panozzo, D.; Mura, C.; Sorkine-Hornung, O.; Pajarola, R. Object detection and classification from large-scale cluttered indoor scans. Comput. Graph. Forum 2014, 33, 11–21. [Google Scholar] [CrossRef]
Georgakis, G.; Mousavian, A.; Berg, A.C.; Kosecka, J. Synthesizingtraining data for object detection in indoor scenes. arXiv 2017, arXiv:1702.07836. [Google Scholar]
Yao, X.; Yang, Y.; Fang, Q.; Chen, Y. Context embedded deep neural network for indoor object detection. In Proceedings of the 2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Xiamen, China, 6–9 November 2017; pp. 332–336. [Google Scholar]
Liu, Y.; Jiang, D.; Xu, C.; Sun, Y.; Jiang, G.; Tao, B.; Tong, X.; Xu, M.; Li, G.; Yun, J. Deep learning based 3D target detection for indoor scenes. Appl. Intell. 2023, 53, 10218–10231. [Google Scholar] [CrossRef]
Ni, J.; Shen, K.; Chen, Y.; Yang, S.X. An improved ssd-like deep network-based object detection method for indoor scenes. IEEE Trans. Instrum. Meas. 2023, 72, 5006915. [Google Scholar] [CrossRef]
Gladis, K.A.; Madavarapu, J.B.; Kumar, R.R.; Sugashini, T. In-out YOLO glass: Indoor-outdoor object detection using adaptive spatial pooling squeeze and attention YOLO network. Biomed. Signal Process. Control 2024, 91, 105925. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
He, L.; Todorovic, S. Destr: Object detection with split transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022; pp. 9377–9386. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Qiao, S.; Wang, H.; Liu, C.; Shen, W.; Yuille, A. Micro-batch training with batch-channel normalization and weight standardization. arXiv 2019, arXiv:1903.10520. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 24 June 2022; pp. 4794–4803. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19 June 2020; pp. 10781–10790. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740, 2022. [Google Scholar]
Lu, X.; Li, B.; Yue, Y.; Li, Q.; Yan, J. Grid r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7363–7372. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]

Figure 1. Example of the indoor complex scene object detection dataset.

Figure 2. Diagram of the I-DINO Network Architecture.

Figure 3. Figure of the Bottleneck Block Structure.

Figure 4. Diagram of the Enhanced Res2Net Network Architecture.

Figure 5. Bi-FPN Network Structure Diagram.

Figure 6. mAP Convergence Curve Graph.

Figure 7. Visualization Results.

Figure 8. Example of Missed Detection Caused by Heavy Occlusion.

Figure 9. Normalized Confusion Matrix for I-DINO’s Classification Rates of Indoor Object Categories.

Table 1. Experimental Results of Different Models (%).

Model	mAP	mAP_50	mAP_75
Faster RCNN [10]	47.7	73.4	53.0
Grid RCNN [40]	47.3	69.0	52.1
Sparse RCNN [41]	45.2	66.4	49.4
Libra RCNN [42]	47.1	72.2	53.1
Cascade RCNN [13]	50.4	72.9	56.2
Cascade RCNN+res2net+DCNv2	56.3	77.4	62.8
YOLOv3 [16]	45.4	70.4	50.5
YOLOXs [18]	55.5	76.9	62.1
YOLOv8s	60.7	80.9	66.1
The method proposed in this paper	62.3	81.7	67.8

Table 2. Ablation Experiment Results of Different Model Algorithms on the Dataset (%).

Group	Res2Net101+	GBi-FPN	SIoU Loss	Deformable Attention	mAP	mAP_50	mAP_75
1	-	-	-		57.1	76.1	62.8
2	✓	-	-		59.9	79.7	65.4
3	✓	✓	-		60.7	80.5	66.7
4	✓	✓	✓		61.2	80.9	66.9
5	✓	✓	✓	✓	62.3	81.7	67.8

Table Footer: “✓ Indicates the presence of the corresponding module in the experiment group”.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fan, Z.; Mei, W.; Liu, W.; Chen, M.; Qiu, Z. I-DINO: High-Quality Object Detection for Indoor Scenes. Electronics 2024, 13, 4419. https://doi.org/10.3390/electronics13224419

AMA Style

Fan Z, Mei W, Liu W, Chen M, Qiu Z. I-DINO: High-Quality Object Detection for Indoor Scenes. Electronics. 2024; 13(22):4419. https://doi.org/10.3390/electronics13224419

Chicago/Turabian Style

Fan, Zhipeng, Wanglong Mei, Wei Liu, Ming Chen, and Zeguo Qiu. 2024. "I-DINO: High-Quality Object Detection for Indoor Scenes" Electronics 13, no. 22: 4419. https://doi.org/10.3390/electronics13224419

APA Style

Fan, Z., Mei, W., Liu, W., Chen, M., & Qiu, Z. (2024). I-DINO: High-Quality Object Detection for Indoor Scenes. Electronics, 13(22), 4419. https://doi.org/10.3390/electronics13224419

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

I-DINO: High-Quality Object Detection for Indoor Scenes

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Indoor Object Detection Model Based on DINO

2.3. Res2Net Network

2.4. Deformable Attention Module

2.5. GBi-FPN

2.6. SIoU Loss

3. Results

3.1. Evaluation Metrics

3.2. Model Performance Analysis

3.3. Comparative Experiment

3.4. Ablation Experiment

3.5. Visualization Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI