AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models
Abstract
Figure 1. Illustration of AEGIS benchmark covering visual understanding, generation, editing, and interleaved generation tasks.
The capability of Unified Multimodal Models (UMMs) to apply world knowledge across diverse tasks remains a critical, unresolved challenge. Existing benchmarks fall short, offering only siloed, single-task evaluations with limited diagnostic power. To bridge this gap, we propose AEGIS (Assessing Editing, Generation, Interpretation-Understanding for Super-intelligence), a comprehensive multi-task benchmark covering visual understanding, generation, editing, and interleaved generation. AEGIS comprises 1,050 challenging, manually-annotated questions spanning 21 topics (including STEM, humanities, daily life) and 6 reasoning types. To concretely evaluate UMM performance without ambiguous metrics, we propose Deterministic Checklist-based Evaluation (DCE), a protocol utilizing atomic "Y/N" judgments. Our extensive experiments reveal that most UMMs exhibit severe world knowledge deficits and performance degrades significantly with complex reasoning.
Motivation
Real-world applications require AI to handle diverse requests demanding sophisticated reasoning and world knowledge. However, current benchmarks suffer from three fundamental limitations:
- Siloed Evaluation: Confined to single-task assessments, failing to measure inter-task gaps in UMMs.
- Lack of Diagnostics: Difficult to discern whether failures stem from understanding (LLM) or generation modules.
- Limited Scope: Inadequate coverage of "corner knowledge" and complex reasoning capabilities.
Highlights
- Comprehensive Multi-Task Benchmark: Assesses Visual Understanding, Generation, Editing, and Interleaved Generation simultaneously.
- Extensive Knowledge Coverage: 1,050 questions across 21 topics (STEM, Humanities, Daily Life) and 6 reasoning types.
- Deterministic Evaluation (DCE): A novel checklist-based protocol that replaces ambiguous scores with atomic "Yes/No" judgments for reliability.
- In-depth Diagnosis: Reveals severe world knowledge deficits in SOTA UMMs and the impact of reasoning complexity.
Methodology: AEGIS & DCE
We constructed AEGIS using a rigorous human-in-the-loop pipeline. The dataset is structured around:
Reasoning Types: Spatial, Temporal, Causal, Comparative, Analogical, and Logical reasoning. Each prompt is designed to be "Reasoning-Enhanced" to mimic real-world complexity.
To overcome the ambiguity of standard "LLM-as-a-Judge" scoring, DCE decomposes evaluation into verifiable steps:
- Checklist Generation: An MLLM (e.g., Gemini) generates a set of atomic "Yes/No" questions based on reference answers and keywords.
- Deterministic Judgment: A judge model evaluates the model's response against each item in the checklist.
- Scoring: The final score is the percentage of "Yes" judgments, ensuring concrete and reliable metrics.
Figure 2. The pipeline of AEGIS data construction and Deterministic Checklist-based Evaluation (DCE).
Main Experiment Results
We evaluated various Unified Multimodal Models (UMMs) and Single-Task Generative Models. Below is the overall performance comparison.
| Model | Understanding | Generation | Editing | Interleaved Generation | Overall | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| STEM | Humanity | Life | STEM | Humanity | Life | STEM | Humanity | Life | STEM | Humanity | Life | ||
| Unified Multimodal Models | |||||||||||||
| Nano Banana Pro | 77.7 | 79.3 | 70.4 | 62.2 | 64.4 | 58.2 | 64.1 | 67.8 | 58.5 | 42.6 | 40.2 | 38.5 | 64.3 |
| Gemini Nano Banana | 64.5 | 65.7 | 55.0 | 42.6 | 49.5 | 45.5 | 44.4 | 62.4 | 54.2 | 50.2 | 41.6 | 43.4 | 52.9 |
| GPT-4o + GPT-Image-1 | 52.9 | 50.9 | 46.9 | 38.2 | 51.6 | 42.8 | 39.4 | 53.2 | 45.2 | 38.9 | 34.7 | 33.0 | 45.7 |
| Bagel-7B w/o CoT | 25.5 | 26.9 | 19.3 | 12.1 | 20.6 | 15.3 | 15.0 | 17.6 | 21.2 | 13.0 | 11.9 | 8.2 | 18.5 |
| Bagel-7B w. CoT | 31.8 | 31.7 | 22.0 | 14.9 | 31.2 | 21.3 | 11.6 | 23.5 | 23.1 | 11.8 | 11.9 | 9.9 | 22.3 |
| Ovis-U1* | 26.3 | 31.0 | 17.1 | 12.7 | 25.0 | 16.3 | 19.3 | 27.2 | 25.9 | 12.2 | 12.5 | 8.4 | 21.2 |
| BLIP3o* | 30.8 | 43.3 | 21.7 | 3.7 | 6.3 | 2.7 | 2.6 | 4.4 | 4.5 | 9.4 | 8.3 | 4.0 | 13.7 |
| Qwen-Image | 31.4 | 41.2 | 22.7 | 17.9 | 31.9 | 25.4 | 20.7 | 35.4 | 33.4 | 22.0 | 18.7 | 17.9 | 28.0 |
| Janus-Pro 7B# | 7.9 | 18.0 | 9.7 | 13.7 | 17.0 | 18.2 | - | - | - | 2.2 | 6.5 | 2.7 | - |
| Show-o2# | 15.4 | 26.6 | 11.1 | 16.7 | 24.7 | 22.5 | - | - | - | 5.3 | 8.7 | 3.4 | - |
| Emu-3# | 3.1 | 8.0 | 2.0 | 8.9 | 19.7 | 14.5 | - | - | - | - | - | - | - |
| Understanding MLLMs | |||||||||||||
| Qwen-3-VL 8B | 42.6 | 48.1 | 34.0 | - | - | - | - | - | - | - | - | - | - |
| Kimi-VL-A3B | 30.6 | 36.4 | 23.5 | - | - | - | - | - | - | - | - | - | - |
| GPT-5 | 67.0 | 57.4 | 60.6 | - | - | - | - | - | - | - | - | - | - |
| Gemini-2.5-Pro | 72.1 | 77.3 | 63.3 | - | - | - | - | - | - | - | - | - | - |
| Image Generation or Editing Models | |||||||||||||
| FLUX.1-Dev* | - | - | - | 15.4 | 29.2 | 16.8 | - | - | - | - | - | - | - |
| Step1X-Edit* | - | - | - | - | - | - | 19.8 | 31.9 | 37.1 | - | - | - | - |
| Instruct-Pix2Pix* | - | - | - | - | - | - | 17.3 | 17.6 | 23.5 | - | - | - | - |
| Seedream* | - | - | - | 33.9 | 43.7 | 38.6 | 32.8 | 53.0 | 43.0 | - | - | - | - |
* Table 3 (Complete): Performance comparison across tasks and domains. Gemini Nano Banana and GPT-4o achieve significantly better results than open-source counterparts. Models marked with * cannot handle interleaved inputs, # denotes models with unsupported tasks.
Figure 3. Visualization of UMMs on Understanding, Generation, and Editing tasks. Nano Banana shows promising quality and consistency.
Core Findings
Severe World Knowledge Deficits
Most UMMs, with the exception of Gemini Nano Banana, exhibit significant deficits in world knowledge. Open-source models lag substantially behind closed-source counterparts due to limited parameter scale and data quality.
Reasoning Complexity Impact
Performance degrades considerably across all models when complex reasoning (e.g., Temporal, Causal) is introduced. "Interleaved Generation," which demands the most complex reasoning, shows a precipitous drop in scores.
Understanding Restricts Generation
There is a clear performance hierarchy: Understanding > Generation > Editing > Interleaved. If a model fails to understand the knowledge in the prompt, it cannot generate or edit correctly.
Mitigation via Reasoning Modules
Integrating simple plug-in reasoning modules (like CoT or external clear prompts) can partially mitigate deficits, activating the model's inherent world knowledge.
Visual Decoder Bottleneck
Appendix analysis identifies the visual decoder as a primary bottleneck. Even when the LLM component provides precise and detailed descriptions (e.g., via rewritten prompts), the visual decoder frequently fails to render specific world knowledge concepts accurately.
Demo
Explore a sample of AEGIS benchmark tasks across four multimodal capabilities: Understanding, Generation, Editing, and Interleaved Generation.
Loading demo content...
Citation
@article{lin2025aegis,
title={AEGIS: Exploring the Limit of World Knowledge Capabilities for Unified Multimodal Models},
author={Lin, Jintao and Dong, Bowen and Shi, Weikang and Lei, Chenyang and Zhang, Suiyun and Liu, Rui and Liu, Xihui},
journal={arXiv preprint},
year={2025}
}