4.3. Main Results
This section presents a detailed comparative analysis of TGOOD against various OWOD algorithms. To maintain fairness in comparison, we follow prior work by removing the energy evaluation module from ORE, denoting the modified version as ORE-EBUI. For OWOD SPLIT dataset, we categorize those models into two groups: those that utilize pseudo-labels and those that do not.
Table 1 and
Table 2 summarize the comparison results for these two categories.
Table 3 presents the performance comparison between TGOOD and existing models on the MS-COCO SPLIT dataset.
Table 4 shows the performance comparison of TGOOD with classic OWOD methods relying on objectness scores to select pseudo-labels on the COCO-O SPLIT. A downward arrow (↓) indicates that lower values signify better model performance, while an upward arrow (↑) indicates that higher values are preferred. Best results are marked in red, with the next results indicated in blue.
Table 1 compares TGOOD with existing pseudo-label-based OWOD methods under the OWOD SPLIT.
Among all the comparative methods, ORE-EBUI, OW-DETR, SA, UC-OWOD, and RandBox all employ pseudo-label strategies based on objectness scores. A common deficiency of these methods is their weak detection capability for unknown-class objects, as indicated by their low UR values. Particularly, RandBox, despite achieving an advantage in detection accuracy for known classes with its diffusion model-based detection framework, only achieved recall rates of 10.6%, 6.3%, and 7.8% for unknown classes in tasks 1, 2, and 3, respectively. We attribute the insufficiency of these models in detecting unknown classes primarily to the mechanism of pseudo-label generation. Since the learning of objectness scores is based on labeled known-class objects, models tend to misjudge candidates containing parts of known classes as unknown-class objects when selecting labels based on objectness scores, thereby diminishing the model’s ability to distinguish between known- and unknown-class objects and resulting in an inability to efficiently identify true unknown-class objects. In contrast, TGOOD, through its SRS module, generates unbiased pseudo-labels for unknown classes under textual guidance, significantly enhancing the model’s detection capability for unknown-class objects, achieving recall rates of 23.1%, 18.7%, and 22.1% for unknown classes in tasks 1, 2, and 3, respectively.
Additionally, RE-OWOD and CAT, which employ additional pseudo-label generation strategies such as selective search [
20], also demonstrate decent detection performance. Compared to these methods, TGOOD shows comparable or superior performance. Although RE-OWOD’s detection accuracy for known classes is slightly higher than that of TGOOD in tasks 2, 3, and 4, its recall rate for unknown-class objects is almost half that of TGOOD. Overall, TGOOD has achieved a better balance between the detection performance of known and unknown classes among all comparative methods, especially excelling in the recall rate of unknown classes. However, there is still room for improvement in the model’s performance on the WI and A-OSE metrics, indicating that there is further research potential in enhancing the model’s ability to distinguish between known- and unknown-class objects, which will become one of the focal points of future research.
Table 2 presents a performance comparison between TGOOD and existing OWOD methods that do not use pseudo-labels under the OWOD SPLIT. Among those methods, OCPL aims to reduce the overlap between known- and unknown-class distributions in the feature space to construct more discriminative feature representations; 2B-OCD employs an object-centered calibrator to identify candidate boxes with scores above a certain threshold in the
as unknown classes; PROB utilizes class-agnostic Gaussian distributions to model object features; Ann adopts a label transfer learning paradigm to decouple features of known- and unknown-class objects. The essence of these methods is to enhance the separability of known- and unknown-class objects in the feature space. Compared to the classic objectness score-based pseudo-label methods, approaches that do not rely on pseudo-labels indeed achieve higher recall rates for unknown classes. This indirectly confirms the bias problem towards known classes in objectness score-based pseudo-label methods. However, the proposed TGOOD in this paper, with the synergistic effect of multiple modules, has achieved optimal or near-optimal results in both the detection accuracy of known classes and the recall rate of unknown classes in all task phases.
Table 3 compares TGOOD with existing OWOD methods under the MS-COCO SPLIT. The experimental results demonstrate that TGOOD also significantly enhances the model’s detection performance on the more challenging MS-COCO SPLIT dataset. Specifically, TGOOD excels in detecting unknown classes, achieving recall rates of 29.4%, 29.0%, and 35.1% for tasks 1, 2, and 3, respectively, surpassing all comparative methods. However, we observed that in task 1, although TGOOD showed substantial improvement in the detection accuracy of known classes compared to ORE-EBUI, its performance was still behind methods employing the Deformable DETR framework [
43], such as OW-DETR, PROB, and CAT. We attribute this gap to the basic framework used. The FQR-CNN framework adopted by TGOOD is more lightweight compared to Deformable DETR. However, as the tasks progress, TGOOD demonstrated superior performance in both known-class detection accuracy and unknown-class recall in tasks 2, 3, and 4. Notably, a significant issue common to comparative methods is that the model’s detection capability for current known classes is markedly lower than for previously known classes, with the cur-mAP value significantly lower than the pre-mAP value. TGOOD effectively reduced this discrepancy and achieved the optimal value in overall detection accuracy (both mAP). We believe this is due to our proposed RRM module, which enables the model to learn more discriminative foreground object features, thereby enhancing the model’s generalization and robustness across different tasks.
Table 4 compares the detection performance of TGOOD with traditional methods, ORE-EBUI and OW-DETR, which use objectness scores for pseudo-labeling, on the COCO-O SPLIT dataset. The average performance across the six domain datasets shows that TGOOD’s ability to detect known classes is comparable to or slightly better than ORE-EBUI and OW-DETR. Notably, TGOOD maintains superior unknown-class recognition capabilities across unseen domains, outperforming the comparison methods. This is attributed to the effective guidance provided by text information containing abstract semantics. The general nature of semantic information in the text helps TGOOD sustain high generalization performance even with detection data from various domain fields.
4.4. Ablation Study
Extensive ablation experiments were conducted to validate the effectiveness of TGOOD. Unless stated otherwise, these experiments were performed on task 1 using the OWOD SPLIT dataset setting.
4.4.1. Components of TGOOD
To evaluate the effectiveness of each module in TGOOD, we conducted ablation experiments for each module individually. The results are presented in
Table 5. The baseline, shown in the first row, is the FQR-CNN model without any enhancements. This baseline highlights that the basic FQR-CNN detection model lacks the capability to detect objects of unknown classes.
“Obj1” refers to selecting a candidate query with the highest objectness score as the unknown-class object, similar to the pseudo-labeling method used in ORE-EBUI. The results in the second row demonstrate that “Obj1” allows our basic detection model to exhibit some capability in detecting unknown-class objects. However, it is insufficient for effectively distinguishing between known and unknown categories, as indicated by the relatively high A-OSE score. “Obj5” involves selecting the top five candidate queries with the highest objectness scores as unknown-class objects, akin to the pseudo-labeling method used in OW-DETR. The third row of results shows that increasing the number of pseudo-labeled candidates enhances the model’s ability to identify unknown-class objects, with the UR value improving from 10.0% to 13.7%. Nevertheless, this approach significantly degrades the accuracy of known-class object detection, with the mAP decreasing from 56.42% to 55.87%.
“SRS” refers to the pseudo-label generation module proposed in this paper. The results in the fourth row demonstrate that this module significantly enhances the recall rate for unknown-class objects, with a UR rate 2.28 times higher than that of “Obj1”. Additionally, “SRS” effectively mitigates the issue of mistakenly identifying unknown-class objects as known-class objects, reducing the A-OSE score from 77,927 to 15,169 compared to “Obj1”. “OTM” stands for replacing the one-to-one label matching algorithm with a one-to-many label matching algorithm. As illustrated in the fifth row, assigning multiple candidate queries to each
during training strengthens the supervisory signal, thereby improving the model’s performance in detecting known-class objects. Furthermore, the integrated “RRM” module enhances the prominent features of foreground objects. As shown in the sixth row, this module further improves the model’s ability to detect all foreground objects, including both known- and unknown-class objects. In
Figure 6, we can intuitively perceive the influence of TGOOD’s various components on the model’s performance in terms of mAP and UR metrics.
4.4.2. The Versatility of SRS
For the pseudo-label generation module SRS proposed in this paper, we evaluated the effectiveness of its various substructures through comparative experiments.
Figure 7 provides an overview of these results: Item
A demonstrates the standard version of the text-guided pseudo-label generation strategy. It achieves a UR value of 19.4%, significantly outperforming existing pseudo-label methods that rely on objectness scores. Item
B shows that incorporating the random de-biasing scheme further improves the UR value by an additional 3.5%. This indicates that random selection effectively reduces the model’s bias towards known-class objects. Item
C illustrates the enhanced version of the text-guided pseudo-label generation strategy. Applying the secondary filtering with a higher similarity threshold results in higher-quality pseudo-labels. This enhancement improves the model’s detection performance for both known and unknown classes.
Additionally, since the basic target detection framework used in this paper differs from traditional OWOD methods based on objectness scores (e.g., ORE-EBUI), we tested the cross-framework effectiveness of the proposed SRS module by applying it to ORE-EBUI. During training, for candidate boxes that did not match any , those with extremely low objectness scores were excluded, and the fifty candidate boxes with the highest objectness scores were retained. This step aimed to reduce noise in the pseudo-labeling process. Subsequently, the SRS module was applied.
Table 6 presents the comparison results. “ORE-EBUI+SRS” denotes the integration of SRS with ORE-EBUI, while “ORE-EBUI+Obj5” represents the replacement of “Obj1” with “Obj5” in the original ORE-EBUI. The results indicate that simply increasing the number of pseudo-labels for unknown classes negatively impacts the detection performance for known classes, with “Obj5” causing a drop in known-class detection accuracy from 56.00% to 18.31%. In contrast, the SRS module not only maintains but also enhances the detection performance for known-class objects, and improves the recall rate for unknown-class objects from 4.9% to 7.6%. Furthermore, the SRS module significantly improves the WI and A-OSE indicators, which assess the model’s comprehensive detection capabilities in open environments.
4.4.3. Different Methods for ROI Refinement
To assess the effectiveness of the RRM module proposed in this paper, we compared it with the “RoIAttn” module from [
44] and two other commonly used feature enhancement strategies. The comparison results are presented in
Table 7, where “TGOOD-RRM” refers to the TGOOD model without any enhancement module. One comparison method is an LSTM network. In this study, all ROI features are treated as a sequence, and the average of the encoded bidirectional LSTM sequences is taken to generate the enhanced ROI features. The results of this approach are shown in the “BiLstm” row. Additionally, we adopted a graph convolutional neural network (GCN) as another comparison method. In this method, the ROI features are treated as nodes in a graph, with edges formed based on the cosine similarity between the ROI features. To reduce computational complexity, the enhanced features are obtained after performing a single graph convolution update, and the results are shown in the “GCN” row.
The use of an enhancement module positively impacts the model’s performance compared to not using any ROI feature enhancement module. However, our RRM module consistently demonstrates superior overall performance. The BiLSTM and GCN methods increase mAP by 0.26% and 0.1%, and UR by 0.1% and 0.2%, respectively. In contrast, the RRM module proposed in this paper achieves higher improvements, increasing mAP by 0.35% and UR by 0.2%. Notably, when detecting known-class objects, the “RoIAttn” module negatively affects the model’s performance. This issue is likely due to inherent flaws in the “RoIAttn” approach, which involves two additional memory units for clustering operations. In an open environment, where unknown-class objects lack labels, this process is prone to noise interference. In contrast, the RRM module enhances foreground object features by leveraging the similarity between ROI features, thus avoiding this drawback.
4.4.4. Hyperparameter Analysis
In the SRS module proposed in this study, two key hyperparameters,
and
, were evaluated through ablation experiments, with results shown in
Figure 8. To optimize the model’s ability to detect both known- and unknown-class objects, the final values of
and
were set to 0.5 and 0.8, respectively, as these values provided the best performance enhancement.
4.5. Visualization
To provide a more intuitive demonstration of TGOOD’s effectiveness, we present several test results in
Figure 9. After training on task 1, the test results of ORE-EBUI, OW-DETR, and TGOOD are shown in the first, second, and third row, respectively.
Superior Performance: TGOOD demonstrates superior performance in detecting both known and unknown objects within images. First, TGOOD excels at accurately localizing and predicting the categories of known objects with high confidence, even in cases of severe occlusion. For example, TGOOD successfully identifies and classifies a chair leg in the top-right corner of the images in the third column and a small, heavily occluded car in the top-left corner of the fourth column. In contrast, both ORE-EBUI and OW-DETR fail to detect these objects. Second, TGOOD shows a strong ability to detect unknown-class objects. For instance, in the first column, ORE-EBUI does not detect any unknown-class objects, while OW-DETR incorrectly identifies the as an unknown object, generating two nearly identical bounding boxes. TGOOD, however, correctly identifies unknown objects such as a skateboard, shoes, and a board in the image. Although OW-DETR detects more unknown objects in the third column (e.g., broccoli, carrot), it fails to delineate the boundaries of these objects accurately and misses known-class objects. Furthermore, it incorrectly labels two objects that do not belong to any known classes. In contrast, TGOOD accurately locates unknown objects (such as a cup and rice) and correctly identifies all known-class objects (dining table and chair).
Limitations: While the proposed method demonstrates superior object detection capabilities for both known and unknown categories compared to existing methods relying on objectness scores, it faces a limitation in the confidence level when identifying unknown-category objects. The confidence is generally low, around 20%. This limitation arises from the use of a single pseudo-label mechanism, which assigns the same label to all unknown-category objects. As a result, the purity of the pseudo-labels may be compromised. One of the key challenges for future research in open-environment object detection is how to generate high-purity pseudo-labels for unknown objects that lack labels.