MR-GDINO: Efficient Open-World Continual Object Detection

1Harbin Institute of Technology    2Hong Kong Polytechnic University
Equal Advising
Paper Code Checkpoints

Overview

Open-world (OW) models show strong zero- and few-shot adaptation abilities, inspiring their use as initializations in continual learning methods to improve accuracy. Despite promising results on seen classes, such abilities of open-world unseen classes are largely degenerated due to catastrophic forgetting. To formulate and tackle this challenge, we propose open-world continual object detection, requiring detectors to generalize to old, new, and unseen categories in continual learning scenarios. Based on this task, we present OW-COD, a challenging yet practical benchmark to assess detection abilities. The goal is to motivate OW detectors to simultaneously preserve learned classes, adapt to new classes, and maintain open-world capabilities under few-shot adaptations. To mitigate forgetting in unseen categories, we propose MR-GDINO, a strong, efficient and scalable baseline via memory and retrieval mechanisms within a highly scalable memory pool. Experimental results show that existing continual open-world detectors suffer from severe forgetting for both seen and unseen categories. In contrast, MR-GDINO mitigates forgetting with only 0.1\% activated extra parameters, achieving state-of-the-art performance for old, new, and unseen categories.

teaser

OW-COD Benchmark

Task Definition

Open-world continual object detection aims to optimize a pretrained open-world detector f through sequential learning on training sets {D1, ..., DT}. The optimized detector should:

  • Preserve: Accurately detect previously learned old classes
  • Adapt: Effectively learn newly introduced classes
  • Maintain: Keep strong generalization capabilities on unseen open-world classes
Dataset Construction

The benchmark consists of two main components:

  • Seen Categories: Uses ODinW-13 dataset for training and evaluation under few-shot settings
  • Unseen Categories: Leverages LVIS dataset with its large-scale and diverse label space
Evaluation Metrics
  • Average Precision (AP): Reports mean AP for previously learned classes, newly seen classes, and unseen categories
  • Average Rank (Ravg) : A comprehensive metric that emphasizes balanced performance across all category types

MR-GDINO Framework

framework

MR-GDINO is based on a frozen pretrained open-world object detector with explicit visual-language interaction modules (e.g., Grounding DINO). During each step t of training, MR-GDINO initializes concept memory θtcon and visual-language interaction memory θtinc from corresponding parameters in the t-1 step, and optimizes both parameters by t-th training set. After training, θtcon and θtinc are memorized into the memory pool B. During open-world inference scenarios, MR-GDINO uses the global embedding of input image I to retrieve the optimal parameters (ψopt, θoptcon, θoptinc) and use these parameters for accurate predictions.

Leaderboards

1-shot Continual Adaptations
Method Shots APold APnew APseen APunseen avg rank
ZS GDINO 0 35.5 58.8 37.3 20.7 -
CoOp 1 37.9 59.6 39.6 20.5 2.1
Adapter 1 36.3 55.5 37.7 19.7 3.3
L2P 1 31.4 59.4 34.0 18.7 3.9
ZiRa 1 30.9 50.4 32.5 6.9 4.4
MR-GDINO 1 45.6 58.9 46.7 20.6 1.3
3-shot Continual Adaptations
Method Shots APold APnew APseen APunseen avg rank
ZS GDINO 0 35.5 58.8 37.3 20.7 -
CoOp 3 17.8 47.1 20.1 19.4 3.6
Adapter 3 37.4 58.0 39.0 19.4 2.4
L2P 3 32.6 63.1 35.0 18.8 3.6
ZiRa 3 32.4 42.4 33.2 7.3 4.2
MR-GDINO 3 49.9 59.5 50.7 20.6 1.1
5-shot Continual Adaptations
Method Shots APold APnew APseen APunseen avg rank
ZS GDINO 0 35.5 58.8 37.3 20.7 -
CoOp 5 26.2 61.3 28.9 19.1 3.5
Adapter 5 37.5 57.6 39.1 20.2 2.5
L2P 5 29.7 64.4 32.4 17.4 3.7
ZiRa 5 32.6 62.9 35.1 5.8 4.1
MR-GDINO 5 49.9 62.0 50.8 20.6 1.3
10-shot Continual Adaptations
Method Shots APold APnew APseen APunseen avg rank
ZS GDINO 0 35.5 58.8 37.3 20.7 -
CoOp 10 31.7 62.7 34.5 17.0 3.7
Adapter 10 35.6 58.2 37.3 20.4 2.7
L2P 10 23.8 61.8 26.8 17.8 3.9
ZiRa 10 38.5 63.2 40.4 6.9 4.0
MR-GDINO 10 51.3 59.7 51.9 20.7 1.3

Visualization

teaser

Demo on Custom Data

When facing custom data, MR-GDINO should enables two things. First is accurately detecting newly seen categories, these categories usually represent "preferred concepts" (for example, when detecting Dva, MR-GDINO should assign Dva tag rather than person). Second is detecting unseen objects defined in label space. MR-GDINO performs well.

teaser

Citation

@article{owcod,
    author = {Dong, Bowen and Huang, Zitong and Yang, Guanglei and Zhang, Lei and Zuo, Wangmeng},
    title = {MR-GDINO: Efficient Open-World Continual Object Detection},
    year = {2024},
}