AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models

1University of Hong Kong    2The Hong Kong Polytechnic University
3The Chinese University of Hong Kong    4Huawei Research
*Equal contribution, in random order    †Corresponding author
arXiv Paper Code Data

Abstract

AEGIS benchmark illustration

Figure 1. Illustration of AEGIS benchmark covering visual understanding, generation, editing, and interleaved generation tasks.

The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (Assessing Editing, Generation, Interpretation-Understanding for Super-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life) and 6 reasoning types. To concretely evaluate UMM performance without ambiguous metrics, we propose Deterministic Checklist-based Evaluation (DCE), a protocol utilizing atomic "Y/N" judgments. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and performance degrades significantly with complex reasoning.

Motivation

Real-world applications require AI to handle diverse requests demanding sophisticated reasoning and world knowledge. However, current benchmarks suffer from three fundamental limitations:

  • Siloed Evaluation: Confined to single-task assessments, failing to measure inter-task gaps in UMMs.
  • Lack of Diagnostics: Difficult to discern whether failures stem from understanding (LLM) or generation modules.
  • Limited Scope: Inadequate coverage of "corner knowledge" and complex reasoning capabilities.

Highlights

  • Comprehensive Multi-Task Benchmark: Assesses Visual Understanding, Generation, Editing, and Interleaved Generation simultaneously.
  • Extensive Knowledge Coverage: 1,050 questions across 21 topics (STEM, Humanities, Daily Life) and 6 reasoning types.
  • Deterministic Evaluation (DCE): A novel checklist-based protocol that replaces ambiguous scores with atomic "Yes/No" judgments for reliability.
  • In-depth Diagnosis: Reveals severe world knowledge deficits in SOTA UMMs and the impact of reasoning complexity.

Methodology: AEGIS & DCE

1. AEGIS Dataset Construction

We constructed AEGIS using a rigorous human-in-the-loop pipeline. The dataset is structured around:

3 Domains: STEM, Humanities, Daily Life 21 Topics 6 Reasoning Types

Reasoning Types: Spatial, Temporal, Causal, Comparative, Analogical, and Logical reasoning. Each prompt is designed to be "Reasoning-Enhanced" to mimic real-world complexity.

2. Deterministic Checklist-based Evaluation (DCE)

To overcome the ambiguity of standard "LLM-as-a-Judge" scoring, DCE decomposes evaluation into verifiable steps:

  1. Checklist Generation: An MLLM (e.g., Gemini) generates a set of atomic "Yes/No" questions based on reference answers and keywords.
  2. Deterministic Judgment: A judge model evaluates the model's response against each item in the checklist.
  3. Scoring: The final score is the percentage of "Yes" judgments, ensuring concrete and reliable metrics.
AEGIS data construction and DCE pipeline

Figure 2. The pipeline of AEGIS data construction and Deterministic Checklist-based Evaluation (DCE).

Main Experiment Results

We evaluated various Unified Multimodal Models (UMMs) and Single-Task Generative Models. Below is the overall performance comparison.

Model Understanding Generation Editing Interleaved Generation Overall
STEM Humanity Life STEM Humanity Life STEM Humanity Life STEM Humanity Life
Unified Multimodal Models
Nano Banana Pro 77.779.370.4 62.264.458.2 64.167.858.5 42.640.238.5 64.3
Gemini Nano Banana 64.565.755.0 42.649.545.5 44.462.454.2 50.241.643.4 52.9
GPT-4o + GPT-Image-1 52.950.946.9 38.251.642.8 39.453.245.2 38.934.733.0 45.7
Bagel-7B w/o CoT 25.526.919.3 12.120.615.3 15.017.621.2 13.011.98.2 18.5
Bagel-7B w. CoT 31.831.722.0 14.931.221.3 11.623.523.1 11.811.99.9 22.3
Ovis-U1* 26.331.017.1 12.725.016.3 19.327.225.9 12.212.58.4 21.2
BLIP3o* 30.843.321.7 3.76.32.7 2.64.44.5 9.48.34.0 13.7
Qwen-Image 31.441.222.7 17.931.925.4 20.735.433.4 22.018.717.9 28.0
Janus-Pro 7B# 7.918.09.7 13.717.018.2 --- 2.26.52.7 -
Show-o2# 15.426.611.1 16.724.722.5 --- 5.38.73.4 -
Emu-3# 3.18.02.0 8.919.714.5 --- --- -
Understanding MLLMs
Qwen-3-VL 8B 42.648.134.0 --- --- --- -
Kimi-VL-A3B 30.636.423.5 --- --- --- -
GPT-5 67.057.460.6 --- --- --- -
Gemini-2.5-Pro 72.177.363.3 --- --- --- -
Image Generation or Editing Models
FLUX.1-Dev* --- 15.429.216.8 --- --- -
Step1X-Edit* --- --- 19.831.937.1 --- -
Instruct-Pix2Pix* --- --- 17.317.623.5 --- -
Seedream* --- 33.943.738.6 32.853.043.0 --- -

* Table 3 (Complete): Performance comparison across tasks and domains. Gemini Nano Banana and GPT-4o achieve significantly better results than open-source counterparts. Models marked with * cannot handle interleaved inputs, # denotes models with unsupported tasks.

Qualitative results visualization

Figure 3. Visualization of UMMs on Understanding, Generation, and Editing tasks. Nano Banana shows promising quality and consistency.

Core Findings

Severe World Knowledge Deficits

Most UMMs, with the exception of Gemini Nano Banana, exhibit significant deficits in world knowledge. Open-source models lag substantially behind closed-source counterparts due to limited parameter scale and data quality.

Reasoning Complexity Impact

Performance degrades considerably across all models when complex reasoning (e.g., Temporal, Causal) is introduced. "Interleaved Generation," which demands the most complex reasoning, shows a precipitous drop in scores.

Understanding Restricts Generation

There is a clear performance hierarchy: Understanding > Generation > Editing > Interleaved. If a model fails to understand the knowledge in the prompt, it cannot generate or edit correctly.

Mitigation via Reasoning Modules

Integrating simple plug-in reasoning modules (like CoT or external clear prompts) can partially mitigate deficits, activating the model's inherent world knowledge.

Visual Decoder Bottleneck

Appendix analysis identifies the visual decoder as a primary bottleneck. Even when the LLM component provides precise and detailed descriptions (e.g., via rewritten prompts), the visual decoder frequently fails to render specific world knowledge concepts accurately.

Demo

Explore a sample of AEGIS benchmark tasks across four multimodal capabilities: Understanding, Generation, Editing, and Interleaved Generation.

Loading demo content...

Citation

@article{lin2025aegis,
    title={AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models},
    author={Lin, Jintao and Dong, Bowen and Shi, Weikang and Lei, Chenyang and Zhang, Suiyun and Liu, Rui and Liu, Xihui},
    journal={arXiv preprint},
    year={2025}
}