MR-GDINO: Efficient Open-World Continual Object Detection

Bowen Dong^1,2 Zitong Huang¹ Guanglei Yang¹ Lei Zhang^2⧺ Wangmeng Zuo^1⧺

¹Harbin Institute of Technology ²Hong Kong Polytechnic University
^⧺Equal Advising

Overview

Open-world (OW) models show strong zero- and few-shot adaptation abilities, inspiring their use as initializations in continual learning methods to improve accuracy. Despite promising results on seen classes, such abilities of open-world unseen classes are largely degenerated due to catastrophic forgetting. To formulate and tackle this challenge, we propose open-world continual object detection, requiring detectors to generalize to old, new, and unseen categories in continual learning scenarios. Based on this task, we present OW-COD, a challenging yet practical benchmark to assess detection abilities. The goal is to motivate OW detectors to simultaneously preserve learned classes, adapt to new classes, and maintain open-world capabilities under few-shot adaptations. To mitigate forgetting in unseen categories, we propose MR-GDINO, a strong, efficient and scalable baseline via memory and retrieval mechanisms within a highly scalable memory pool. Experimental results show that existing continual open-world detectors suffer from severe forgetting for both seen and unseen categories. In contrast, MR-GDINO mitigates forgetting with only 0.1\% activated extra parameters, achieving state-of-the-art performance for old, new, and unseen categories.

OW-COD Benchmark

Task Definition

Open-world continual object detection aims to optimize a pretrained open-world detector f through sequential learning on training sets {D₁, ..., D_T}. The optimized detector should:

Preserve: Accurately detect previously learned old classes
Adapt: Effectively learn newly introduced classes
Maintain: Keep strong generalization capabilities on unseen open-world classes

Dataset Construction

The benchmark consists of two main components:

Seen Categories: Uses ODinW-13 dataset for training and evaluation under few-shot settings
Unseen Categories: Leverages LVIS dataset with its large-scale and diverse label space

Evaluation Metrics

Average Precision (AP): Reports mean AP for previously learned classes, newly seen classes, and unseen categories
Average Rank (R^avg) : A comprehensive metric that emphasizes balanced performance across all category types

MR-GDINO Framework

MR-GDINO is based on a frozen pretrained open-world object detector with explicit visual-language interaction modules (e.g., Grounding DINO). During each step t of training, MR-GDINO initializes concept memory θ_t_con and visual-language interaction memory θ_t_inc from corresponding parameters in the t-1 step, and optimizes both parameters by t-th training set. After training, θ_t_con and θ_t_inc are memorized into the memory pool B. During open-world inference scenarios, MR-GDINO uses the global embedding of input image I to retrieve the optimal parameters (ψ^opt, θ^opt_con, θ^opt_inc) and use these parameters for accurate predictions.

Leaderboards

1-shot Continual Adaptations

Method	Shots	AP^old	AP^new	AP^seen	AP^unseen	avg rank
ZS GDINO	0	35.5	58.8	37.3	20.7	-
CoOp	1	37.9	59.6	39.6	20.5	2.1
Adapter	1	36.3	55.5	37.7	19.7	3.3
L2P	1	31.4	59.4	34.0	18.7	3.9
ZiRa	1	30.9	50.4	32.5	6.9	4.4
MR-GDINO	1	45.6	58.9	46.7	20.6	1.3

3-shot Continual Adaptations

Method	Shots	AP^old	AP^new	AP^seen	AP^unseen	avg rank
ZS GDINO	0	35.5	58.8	37.3	20.7	-
CoOp	3	17.8	47.1	20.1	19.4	3.6
Adapter	3	37.4	58.0	39.0	19.4	2.4
L2P	3	32.6	63.1	35.0	18.8	3.6
ZiRa	3	32.4	42.4	33.2	7.3	4.2
MR-GDINO	3	49.9	59.5	50.7	20.6	1.1

5-shot Continual Adaptations

Method	Shots	AP^old	AP^new	AP^seen	AP^unseen	avg rank
ZS GDINO	0	35.5	58.8	37.3	20.7	-
CoOp	5	26.2	61.3	28.9	19.1	3.5
Adapter	5	37.5	57.6	39.1	20.2	2.5
L2P	5	29.7	64.4	32.4	17.4	3.7
ZiRa	5	32.6	62.9	35.1	5.8	4.1
MR-GDINO	5	49.9	62.0	50.8	20.6	1.3

10-shot Continual Adaptations

Method	Shots	AP^old	AP^new	AP^seen	AP^unseen	avg rank
ZS GDINO	0	35.5	58.8	37.3	20.7	-
CoOp	10	31.7	62.7	34.5	17.0	3.7
Adapter	10	35.6	58.2	37.3	20.4	2.7
L2P	10	23.8	61.8	26.8	17.8	3.9
ZiRa	10	38.5	63.2	40.4	6.9	4.0
MR-GDINO	10	51.3	59.7	51.9	20.7	1.3

Visualization

Demo on Custom Data

When facing custom data, MR-GDINO should enables two things. First is accurately detecting newly seen categories, these categories usually represent "preferred concepts" (for example, when detecting Dva, MR-GDINO should assign Dva tag rather than person). Second is detecting unseen objects defined in label space. MR-GDINO performs well.

Citation

@article{owcod,
    author = {Dong, Bowen and Huang, Zitong and Yang, Guanglei and Zhang, Lei and Zuo, Wangmeng},
    title = {MR-GDINO: Efficient Open-World Continual Object Detection},
    year = {2024},
}