MR-GDINO: Efficient Open-World Continual Object Detection
⧺Equal Advising
Overview
Open-world (OW) models show strong zero- and few-shot adaptation abilities, inspiring their use as initializations in continual learning methods to improve accuracy. Despite promising results on seen classes, such abilities of open-world unseen classes are largely degenerated due to catastrophic forgetting. To formulate and tackle this challenge, we propose open-world continual object detection, requiring detectors to generalize to old, new, and unseen categories in continual learning scenarios. Based on this task, we present OW-COD, a challenging yet practical benchmark to assess detection abilities. The goal is to motivate OW detectors to simultaneously preserve learned classes, adapt to new classes, and maintain open-world capabilities under few-shot adaptations. To mitigate forgetting in unseen categories, we propose MR-GDINO, a strong, efficient and scalable baseline via memory and retrieval mechanisms within a highly scalable memory pool. Experimental results show that existing continual open-world detectors suffer from severe forgetting for both seen and unseen categories. In contrast, MR-GDINO mitigates forgetting with only 0.1\% activated extra parameters, achieving state-of-the-art performance for old, new, and unseen categories.
OW-COD Benchmark
Open-world continual object detection aims to optimize a pretrained open-world detector f through sequential learning on training sets {D1, ..., DT}. The optimized detector should:
- Preserve: Accurately detect previously learned old classes
- Adapt: Effectively learn newly introduced classes
- Maintain: Keep strong generalization capabilities on unseen open-world classes
The benchmark consists of two main components:
- Seen Categories: Uses ODinW-13 dataset for training and evaluation under few-shot settings
- Unseen Categories: Leverages LVIS dataset with its large-scale and diverse label space
- Average Precision (AP): Reports mean AP for previously learned classes, newly seen classes, and unseen categories
- Average Rank (Ravg) : A comprehensive metric that emphasizes balanced performance across all category types
MR-GDINO Framework
MR-GDINO is based on a frozen pretrained open-world object detector with explicit visual-language interaction modules (e.g., Grounding DINO). During each step t of training, MR-GDINO initializes concept memory θtcon and visual-language interaction memory θtinc from corresponding parameters in the t-1 step, and optimizes both parameters by t-th training set. After training, θtcon and θtinc are memorized into the memory pool B. During open-world inference scenarios, MR-GDINO uses the global embedding of input image I to retrieve the optimal parameters (ψopt, θoptcon, θoptinc) and use these parameters for accurate predictions.
Leaderboards
Method | Shots | APold | APnew | APseen | APunseen | avg rank |
---|---|---|---|---|---|---|
ZS GDINO | 0 | 35.5 | 58.8 | 37.3 | 20.7 | - |
CoOp | 1 | 37.9 | 59.6 | 39.6 | 20.5 | 2.1 |
Adapter | 1 | 36.3 | 55.5 | 37.7 | 19.7 | 3.3 |
L2P | 1 | 31.4 | 59.4 | 34.0 | 18.7 | 3.9 |
ZiRa | 1 | 30.9 | 50.4 | 32.5 | 6.9 | 4.4 |
MR-GDINO | 1 | 45.6 | 58.9 | 46.7 | 20.6 | 1.3 |
Method | Shots | APold | APnew | APseen | APunseen | avg rank |
---|---|---|---|---|---|---|
ZS GDINO | 0 | 35.5 | 58.8 | 37.3 | 20.7 | - |
CoOp | 3 | 17.8 | 47.1 | 20.1 | 19.4 | 3.6 |
Adapter | 3 | 37.4 | 58.0 | 39.0 | 19.4 | 2.4 |
L2P | 3 | 32.6 | 63.1 | 35.0 | 18.8 | 3.6 |
ZiRa | 3 | 32.4 | 42.4 | 33.2 | 7.3 | 4.2 |
MR-GDINO | 3 | 49.9 | 59.5 | 50.7 | 20.6 | 1.1 |
Method | Shots | APold | APnew | APseen | APunseen | avg rank |
---|---|---|---|---|---|---|
ZS GDINO | 0 | 35.5 | 58.8 | 37.3 | 20.7 | - |
CoOp | 5 | 26.2 | 61.3 | 28.9 | 19.1 | 3.5 |
Adapter | 5 | 37.5 | 57.6 | 39.1 | 20.2 | 2.5 |
L2P | 5 | 29.7 | 64.4 | 32.4 | 17.4 | 3.7 |
ZiRa | 5 | 32.6 | 62.9 | 35.1 | 5.8 | 4.1 |
MR-GDINO | 5 | 49.9 | 62.0 | 50.8 | 20.6 | 1.3 |
Method | Shots | APold | APnew | APseen | APunseen | avg rank |
---|---|---|---|---|---|---|
ZS GDINO | 0 | 35.5 | 58.8 | 37.3 | 20.7 | - |
CoOp | 10 | 31.7 | 62.7 | 34.5 | 17.0 | 3.7 |
Adapter | 10 | 35.6 | 58.2 | 37.3 | 20.4 | 2.7 |
L2P | 10 | 23.8 | 61.8 | 26.8 | 17.8 | 3.9 |
ZiRa | 10 | 38.5 | 63.2 | 40.4 | 6.9 | 4.0 |
MR-GDINO | 10 | 51.3 | 59.7 | 51.9 | 20.7 | 1.3 |
Visualization
Demo on Custom Data
When facing custom data, MR-GDINO should enables two things. First is accurately detecting newly seen categories, these categories usually represent "preferred concepts" (for example, when detecting Dva, MR-GDINO should assign Dva tag rather than person). Second is detecting unseen objects defined in label space. MR-GDINO performs well.Citation
@article{owcod, author = {Dong, Bowen and Huang, Zitong and Yang, Guanglei and Zhang, Lei and Zuo, Wangmeng}, title = {MR-GDINO: Efficient Open-World Continual Object Detection}, year = {2024}, }