BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Introduction

Embodied agents require a range of perceptual and reasoning skills—from low-level sensing to high-level planning. Recent works highlights the potential of Multimodal Large Language Models (MLLMs) as embodied agents. Yet, a holistic benchmark evaluating step-wise embodied skills remains absent.

To bridge this gap, we introduce BEAR, a benchmark of 4,469 interleaved image–video–text VQA samples spanning perception to planning. Our systematic evaluation of 20 representative MLLMs and failure analysis reveal that current MLLMs are limited by a lack of omni-visual and 3D spatial abilities.

Motivated by failure analysis, we introduce BEAR-Agent, a multimodal conversable agent that leverages visual tools to enhance MLLMs’ embodied capabilities. BEAR‑Agent significantly boosts InternVL3‑14B and GPT‑5's performance on BEAR. Moreover, our tabletop manipulation experiments demonstrate its potential as a step toward general embodied agents.

BEAR Benchmark Overview

What are basic embodied capabilities for an embodied agent? To answer that question, we inductively summarize from large-scale embodied household activity dataset such as BEHAVIOR-1K and human activities. BEAR has 5 basic categories, Pointing, Bounding Box, Trajectory Reasoning, Spatial Reasoning, Task Planning. The 6th category, long-horizon category, features episodes collected from simulation. It focuses on 14 fine-grained step-wise embodied skills from 6 different categories.

Long-horizon Category

Long-horizon category decomposes embodied episodes into skill-oriented steps for offline evaluation. This category includes 35 episodes from AI2-THOR, each annotated with structured skill steps. As shown in the image, `put the apple in the sink' is broken down into planning, object search, navigation, spatial reasoning, perception, and placement. Each step maps to an atomic skill in BEAR, demonstrating that our taxonomy is practically grounded in embodied tasks.

Statistics

BEAR category distribution.
Distribution of the 4,469 samples across main categories and subtypes.
Each segment shows the detailed breakdown of embodied AI capabilities.

Proprietary Avg: 39.2

Open-source Avg: 25.8

Performance Gap: 13.4

Overall performance comparison.
Performance scores of proprietary vs open-source models on BEAR benchmark.
Shows significant gap between model types across all evaluated capabilities.

BEAR key statistics.
Total questions: 4,469 | Categories: 6 | Subtypes: 15
Multiple-choice: 57.4% | Free-form: 42.6% | Newly generated: 93.3%

BEAR Leaderboard

We evaluate 20 representitive MLLMs, including 12 open-sourced models and 8 proprietary models. GEN = General Object (Pointing/Box); SPA = Spatial Object (Pointing/Box); PRT = Semantic Part (Pointing/Box); PRG = Task Process Reasoning; PRD = Next Action Prediction; GPR = Gripper Trajectory Reasoning; HND = Human Hand Trajectory Reasoning; OBJ = Object Trajectory Reasoning; LOC = Object Localization; PTH = Path Planning; DIR = Relative Direction. BBox scores are scaled by 100 when computing overall average. We refer readers to original papers for details.

Model Results 1

Model Results 2

Overall Best

Open-source Best

Proprietary Models

Human Performance

Model	Format	Pointing			Bounding Box			Task Planning
Model	Format	GEN	SPA	PRT	GEN	SPA	PRT	PRG	PRD
Random Choice	-	-	-	-	-	-	-	25	25
Human Performance	-	95.50	92.00	93.50	0.830	0.770	0.820	87.50	92.00
Proprietary Models
GPT-o3	sequential	59.12	44.44	55.41	0.348	0.278	0.313	57.67	55.33
GPT-5	sequential	70.00	63.69	54.90	0.411	0.326	0.352	59.67	61.00
Gemini-2.5-Pro	sequential	55.00	42.48	55.41	0.144	0.103	0.177	52.00	49.00
Gemini-2.5-Flash	sequential	46.76	33.33	39.49	0.183	0.145	0.156	48.33	43.67
Gemini-2.0-Flash	sequential	51.76	34.97	40.13	0.270	0.167	0.224	38.67	40.00
Open-source Models
InternVL3-14B	merged	37.94	27.78	32.80	0.304	0.258	0.276	41.00	33.00
InternVL3-8B	merged	52.65	42.48	43.95	0.369	0.275	0.297	43.00	33.67
InternVL2-40B	merged	23.24	21.24	22.29	0.329	0.269	0.268	40.00	33.67
Qwen2.5-VL-32B	merged	27.35	27.78	42.68	0.020	0.018	0.017	42.67	42.33
InternVL2-26B	merged	21.18	15.36	18.79	0.201	0.202	0.147	41.33	34.33

Overall Best

Open-source Best

Proprietary Models

Human Performance

Model	Trajectory Reasoning			Spatial Reasoning			Long-horizon	Avg
Model	GPR	HND	OBJ	LOC	PTH	DIR	Long-horizon	Avg
Random Choice	25.0	25.0	25.0	25.0	28.0	25.0	25.0	-
Human Performance	96.5	94.0	89.0	94.5	83.5	88.5	92.5	89.4
Proprietary Models
GPT-o3	67.0	68.4	53.7	70.4	49.3	49.7	34.3	47.6
GPT-5	67.0	67.3	49.7	72.3	50.2	47.0	40.0	52.2
Gemini-2.5-Pro	66.7	66.0	48.3	64.5	40.1	44.0	31.4	41.5
Gemini-2.5-Flash	64.4	64.0	45.0	61.2	43.0	44.7	31.4	38.2
Gemini-2.0-Flash	61.5	59.6	31.3	54.1	33.8	39.7	25.7	36.0
Open-source Models
InternVL3-14B	51.3	49.5	31.4	43.0	28.0	21.3	28.6	33.9
InternVL3-8B	51.3	46.8	27.7	50.2	32.4	20.0	8.6	33.3
InternVL2-40B	57.7	41.8	28.0	40.4	29.5	18.7	11.4	28.4
Qwen2.5-VL-32B	55.5	52.2	26.7	47.2	26.6	22.7	20.0	28.3
InternVL2-26B	53.2	43.8	30.3	26.1	26.6	22.0	11.3	25.7

BEAR Agent Overview

We propose BEAR-Agent, a multimodal conversable agent to leverage visual tools for embodied capabilities enhancement. Please refer to our paper for more details.