BEAR: Benchmarking and Enhancing Multimodal Language Models for
Atomic Embodied Capabilities

1 Northeastern University, 2 The Chinese University of Hong Kong, 3 Peking University, 4 Westlake University, 5 Harvard University, 6 Purdue University, 7 University of Oxford
* Equal contribution † Project lead
Please contact qi.yu2@northeastern.edu if you would like discussions.
Paper PDF Code Podcast HF HuggingFace

Introduction

Embodied agents require a range of perceptual and reasoning skills—from low-level sensing to high-level planning. Recent works highlights the potential of Multimodal Large Language Models (MLLMs) as embodied agents. Yet, a holistic benchmark evaluating step-wise embodied skills remains absent.

To bridge this gap, we introduce BEAR, a benchmark of 4,469 interleaved image–video–text VQA samples spanning perception to planning. Our systematic evaluation of 20 representative MLLMs and failure analysis reveal that current MLLMs are limited by a lack of omni-visual and 3D spatial abilities.

Motivated by failure analysis, we introduce BEAR-Agent, a multimodal conversable agent that leverages visual tools to enhance MLLMs’ embodied capabilities. BEAR‑Agent significantly boosts InternVL3‑14B and GPT‑5's performance on BEAR. Moreover, our tabletop manipulation experiments demonstrate its potential as a step toward general embodied agents.

BEAR Benchmark Overview

What are basic embodied capabilities for an embodied agent? To answer that question, we inductively summarize from large-scale embodied household activity dataset such as BEHAVIOR-1K and human activities. BEAR has 5 basic categories, Pointing, Bounding Box, Trajectory Reasoning, Spatial Reasoning, Task Planning. The 6th category, long-horizon category, features episodes collected from simulation. It focuses on 14 fine-grained step-wise embodied skills from 6 different categories.

Long-horizon Category

Long-horizon category decomposes embodied episodes into skill-oriented steps for offline evaluation. This category includes 35 episodes from AI2-THOR, each annotated with structured skill steps. As shown in the image, `put the apple in the sink' is broken down into planning, object search, navigation, spatial reasoning, perception, and placement. Each step maps to an atomic skill in BEAR, demonstrating that our taxonomy is practically grounded in embodied tasks.

Statistics

BEAR Leaderboard

We evaluate 20 representitive MLLMs, including 12 open-sourced models and 8 proprietary models. GEN = General Object (Pointing/Box); SPA = Spatial Object (Pointing/Box); PRT = Semantic Part (Pointing/Box); PRG = Task Process Reasoning; PRD = Next Action Prediction; GPR = Gripper Trajectory Reasoning; HND = Human Hand Trajectory Reasoning; OBJ = Object Trajectory Reasoning; LOC = Object Localization; PTH = Path Planning; DIR = Relative Direction. BBox scores are scaled by 100 when computing overall average. We refer readers to original papers for details.

Model Results 1
Model Results 2
Overall Best
Open-source Best
Proprietary Models
Human Performance
Model Format Pointing Bounding Box Task Planning
GEN SPA PRT GEN SPA PRT PRG PRD
Random Choice - - - - - - - 25 25
Human Performance - 95.50 92.00 93.50 0.830 0.770 0.820 87.50 92.00
Proprietary Models
GPT-o3 sequential 59.12 44.44 55.41 0.348 0.278 0.313 57.67 55.33
GPT-5 sequential 70.00 63.69 54.90 0.411 0.326 0.352 59.67 61.00
Gemini-2.5-Pro sequential 55.00 42.48 55.41 0.144 0.103 0.177 52.00 49.00
Gemini-2.5-Flash sequential 46.76 33.33 39.49 0.183 0.145 0.156 48.33 43.67
Gemini-2.0-Flash sequential 51.76 34.97 40.13 0.270 0.167 0.224 38.67 40.00
Open-source Models
InternVL3-14B merged 37.94 27.78 32.80 0.304 0.258 0.276 41.00 33.00
InternVL3-8B merged 52.65 42.48 43.95 0.369 0.275 0.297 43.00 33.67
InternVL2-40B merged 23.24 21.24 22.29 0.329 0.269 0.268 40.00 33.67
Qwen2.5-VL-32B merged 27.35 27.78 42.68 0.020 0.018 0.017 42.67 42.33
InternVL2-26B merged 21.18 15.36 18.79 0.201 0.202 0.147 41.33 34.33
Overall Best
Open-source Best
Proprietary Models
Human Performance
Model Trajectory Reasoning Spatial Reasoning Long-horizon Avg
GPR HND OBJ LOC PTH DIR
Random Choice 25.0 25.0 25.0 25.0 28.0 25.0 25.0 -
Human Performance 96.5 94.0 89.0 94.5 83.5 88.5 92.5 89.4
Proprietary Models
GPT-o3 67.0 68.4 53.7 70.4 49.3 49.7 34.3 47.6
GPT-5 67.0 67.3 49.7 72.3 50.2 47.0 40.0 52.2
Gemini-2.5-Pro 66.7 66.0 48.3 64.5 40.1 44.0 31.4 41.5
Gemini-2.5-Flash 64.4 64.0 45.0 61.2 43.0 44.7 31.4 38.2
Gemini-2.0-Flash 61.5 59.6 31.3 54.1 33.8 39.7 25.7 36.0
Open-source Models
InternVL3-14B 51.3 49.5 31.4 43.0 28.0 21.3 28.6 33.9
InternVL3-8B 51.3 46.8 27.7 50.2 32.4 20.0 8.6 33.3
InternVL2-40B 57.7 41.8 28.0 40.4 29.5 18.7 11.4 28.4
Qwen2.5-VL-32B 55.5 52.2 26.7 47.2 26.6 22.7 20.0 28.3
InternVL2-26B 53.2 43.8 30.3 26.1 26.6 22.0 11.3 25.7

BEAR Agent Overview

We propose BEAR-Agent, a multimodal conversable agent to leverage visual tools for embodied capabilities enhancement. Please refer to our paper for more details.