Embodied agents require a range of perceptual and reasoning skills—from low-level sensing to high-level planning. Recent works highlights the potential of Multimodal Large Language Models (MLLMs) as embodied agents. Yet, a holistic benchmark evaluating step-wise embodied skills remains absent.
To bridge this gap, we introduce BEAR, a benchmark of 4,469 interleaved image–video–text VQA samples spanning perception to planning. Our systematic evaluation of 20 representative MLLMs and failure analysis reveal that current MLLMs are limited by a lack of omni-visual and 3D spatial abilities.
Motivated by failure analysis, we introduce BEAR-Agent, a multimodal conversable agent that leverages visual tools to enhance MLLMs’ embodied capabilities. BEAR‑Agent significantly boosts InternVL3‑14B and GPT‑5's performance on BEAR. Moreover, our tabletop manipulation experiments demonstrate its potential as a step toward general embodied agents.
What are basic embodied capabilities for an embodied agent? To answer that question, we inductively summarize from large-scale embodied household activity dataset such as BEHAVIOR-1K and human activities. BEAR has 5 basic categories, Pointing, Bounding Box, Trajectory Reasoning, Spatial Reasoning, Task Planning. The 6th category, long-horizon category, features episodes collected from simulation. It focuses on 14 fine-grained step-wise embodied skills from 6 different categories.
Long-horizon category decomposes embodied episodes into skill-oriented steps for offline evaluation. This category includes 35 episodes from AI2-THOR, each annotated with structured skill steps. As shown in the image, `put the apple in the sink' is broken down into planning, object search, navigation, spatial reasoning, perception, and placement. Each step maps to an atomic skill in BEAR, demonstrating that our taxonomy is practically grounded in embodied tasks.
We evaluate 20 representitive MLLMs, including 12 open-sourced models and 8 proprietary models. GEN = General Object (Pointing/Box); SPA = Spatial Object (Pointing/Box); PRT = Semantic Part (Pointing/Box); PRG = Task Process Reasoning; PRD = Next Action Prediction; GPR = Gripper Trajectory Reasoning; HND = Human Hand Trajectory Reasoning; OBJ = Object Trajectory Reasoning; LOC = Object Localization; PTH = Path Planning; DIR = Relative Direction. BBox scores are scaled by 100 when computing overall average. We refer readers to original papers for details.
Model | Format | Pointing | Bounding Box | Task Planning | |||||
---|---|---|---|---|---|---|---|---|---|
GEN | SPA | PRT | GEN | SPA | PRT | PRG | PRD | ||
Random Choice | - | - | - | - | - | - | - | 25 | 25 |
Human Performance | - | 95.50 | 92.00 | 93.50 | 0.830 | 0.770 | 0.820 | 87.50 | 92.00 |
Proprietary Models | |||||||||
GPT-o3 | sequential | 59.12 | 44.44 | 55.41 | 0.348 | 0.278 | 0.313 | 57.67 | 55.33 |
GPT-5 | sequential | 70.00 | 63.69 | 54.90 | 0.411 | 0.326 | 0.352 | 59.67 | 61.00 |
Gemini-2.5-Pro | sequential | 55.00 | 42.48 | 55.41 | 0.144 | 0.103 | 0.177 | 52.00 | 49.00 |
Gemini-2.5-Flash | sequential | 46.76 | 33.33 | 39.49 | 0.183 | 0.145 | 0.156 | 48.33 | 43.67 |
Gemini-2.0-Flash | sequential | 51.76 | 34.97 | 40.13 | 0.270 | 0.167 | 0.224 | 38.67 | 40.00 |
Open-source Models | |||||||||
InternVL3-14B | merged | 37.94 | 27.78 | 32.80 | 0.304 | 0.258 | 0.276 | 41.00 | 33.00 |
InternVL3-8B | merged | 52.65 | 42.48 | 43.95 | 0.369 | 0.275 | 0.297 | 43.00 | 33.67 |
InternVL2-40B | merged | 23.24 | 21.24 | 22.29 | 0.329 | 0.269 | 0.268 | 40.00 | 33.67 |
Qwen2.5-VL-32B | merged | 27.35 | 27.78 | 42.68 | 0.020 | 0.018 | 0.017 | 42.67 | 42.33 |
InternVL2-26B | merged | 21.18 | 15.36 | 18.79 | 0.201 | 0.202 | 0.147 | 41.33 | 34.33 |
Model | Trajectory Reasoning | Spatial Reasoning | Long-horizon | Avg | ||||
---|---|---|---|---|---|---|---|---|
GPR | HND | OBJ | LOC | PTH | DIR | |||
Random Choice | 25.0 | 25.0 | 25.0 | 25.0 | 28.0 | 25.0 | 25.0 | - |
Human Performance | 96.5 | 94.0 | 89.0 | 94.5 | 83.5 | 88.5 | 92.5 | 89.4 |
Proprietary Models | ||||||||
GPT-o3 | 67.0 | 68.4 | 53.7 | 70.4 | 49.3 | 49.7 | 34.3 | 47.6 |
GPT-5 | 67.0 | 67.3 | 49.7 | 72.3 | 50.2 | 47.0 | 40.0 | 52.2 |
Gemini-2.5-Pro | 66.7 | 66.0 | 48.3 | 64.5 | 40.1 | 44.0 | 31.4 | 41.5 |
Gemini-2.5-Flash | 64.4 | 64.0 | 45.0 | 61.2 | 43.0 | 44.7 | 31.4 | 38.2 |
Gemini-2.0-Flash | 61.5 | 59.6 | 31.3 | 54.1 | 33.8 | 39.7 | 25.7 | 36.0 |
Open-source Models | ||||||||
InternVL3-14B | 51.3 | 49.5 | 31.4 | 43.0 | 28.0 | 21.3 | 28.6 | 33.9 |
InternVL3-8B | 51.3 | 46.8 | 27.7 | 50.2 | 32.4 | 20.0 | 8.6 | 33.3 |
InternVL2-40B | 57.7 | 41.8 | 28.0 | 40.4 | 29.5 | 18.7 | 11.4 | 28.4 |
Qwen2.5-VL-32B | 55.5 | 52.2 | 26.7 | 47.2 | 26.6 | 22.7 | 20.0 | 28.3 |
InternVL2-26B | 53.2 | 43.8 | 30.3 | 26.1 | 26.6 | 22.0 | 11.3 | 25.7 |
We propose BEAR-Agent, a multimodal conversable agent to leverage visual tools for embodied capabilities enhancement. Please refer to our paper for more details.