Learning Action Manifolds with Multi-view Latent Priors

This paper addresses the challenges of spatial perception and manipulation in Vision-Language-Action (VLA) models within complex environments. While recent works have attempted to enhance VLA perception ability by injecting or distilling features from 3D foundation models, they, however, achieve only limited spatial understanding accuracy improvement, due to biased geometric predictions led by inherent depth ambiguity under monocular input conditions. Furthermore, existing action generation methods rely on indirect paradigms that predict high-dimensional noise or velocity. Regressing these unstructured targets imposes a significant optimization burden, which intensifies as the action dimensionality increases, thereby hindering the efficient learning of complex robotic policies.

To address these issues, we present a VLA framework with the following two novel designs. First, to tackle the depth ambiguity from monocular inputs, we propose to leverage pre-trained multi-view diffusion models to synthesize novel views in latent space, obtaining enriched scene context with largely reduced geometric uncertainties. To effectively integrate these multi-view latent priors, we present Geometry-Guided Gated Transformer (G³T), which is designed to align multi-view latent features under the guidance of monocular 3D geometric priors and selectively aggregate informative views while suppressing noise from occluded regions based on an adaptive gating mechanism. Second, to overcome the optimization inefficiencies, we introduce Action Manifold Learning (AML). Unlike traditional methods that decode abstract noise or velocity, AML shifts the prediction target to direct action estimation, enabling the policy to focus on learning the intrinsic structure of valid actions for more efficient and robust execution. Extensive evaluations on LIBERO, LIBERO-Plus, RoboTwin 2.0, and real-world robot experiments demonstrate that our method outperforms state-of-the-art baselines in both success rate and robustness.

Figure 1. Methodology comparison. Existing VLA models either rely on expensive RGB-D sensors for explicit 3D input (a) or suffer from severe depth ambiguity under monocular settings (b). In contrast, our method leverages multi-view diffusion prior and Geometry-Guided Gated Transformer ($\text{G}^3\text{T}$) to synthesize robust geometric features from a single RGB image, resolving the depth ambiguity without utilizing extra hardware (c). As shown in (d), our method demonstrates stability against disturbances under LIBERO-Spatial tasks. Dark bars: success rate under perturbation, light bars: original result. Our approach exhibits minimal degradation (8.0) compared to baselines.

Method	Spatial	Object	Goal	Long	Average
Diffusion Policy	78.5	87.5	73.5	64.8	76.1
OpenVLA	84.7	88.4	79.2	53.7	76.5
SpatialVLA	88.2	89.9	78.6	55.5	78.1
CoT-VLA	87.5	91.6	87.6	69.0	83.9
$\pi_0$-Fast	96.4	96.8	88.6	60.2	85.5
GR00T-N1	94.4	97.6	93.0	90.6	93.9
$\pi_0$	98.0	96.8	94.4	88.4	94.4
F1	98.2	97.8	95.4	91.3	95.7
InternVLA-M1	98.0	99.0	93.8	92.6	95.9
Discrete Diffusion VLA	97.2	98.6	97.4	92.0	96.3
$\pi_{0.5}$	98.8	98.2	98.0	92.4	96.9
GR00T-N1.6	97.7	98.5	97.5	94.4	97.0
OpenVLA-OFT	97.6	98.4	97.9	94.5	97.1
UniVLA	96.5	96.8	95.6	92.0	95.2
X-VLA	98.2	98.6	97.8	97.6	98.1
GeoVLA	98.4	99.0	96.6	96.6	97.7
3D-CAVLA	98.2	99.8	98.2	96.1	98.1
Spatial Forcing	99.4	99.6	98.8	96.0	98.5
Ours	98.8	99.8	99.0	96.6	98.6

Method	Camera	Robot	Language	Light	Background	Noise	Layout	Total
OpenVLA	0.8	3.5	23.0	8.1	34.8	15.2	28.5	15.6
OpenVLA-OFT	56.4	31.9	79.5	88.7	93.3	75.8	74.2	69.6
OpenVLA-OFT_w	10.4	38.7	70.5	76.8	93.6	49.9	69.9	55.8
Openvla-OFT_m	55.6	21.7	81.0	92.7	91.0	78.6	68.7	67.9
NORA	2.2	37.0	65.1	45.7	58.6	12.8	62.1	39.0
WorldVLA	0.1	27.9	41.6	43.7	17.1	10.9	38.0	25.0
UniVLA	1.8	46.2	69.6	69.0	81.0	21.2	31.9	42.9
$\pi_0$	13.8	6.0	58.8	85.0	81.4	79.0	68.9	53.6
$\pi_0$-Fast	65.1	21.6	61.0	73.2	73.2	74.4	68.8	61.6
RIPT-VLA	55.2	31.2	77.6	88.4	91.6	73.5	74.2	68.4
MergeVLA	50.7	30.3	66.0	84.2	85.7	66.0	68.1	62.5
UnifoLM-VLA-0	56.7	69.5	91.2	93.6	95.3	77.9	77.9	78.9
Ours	89.6	60.1	86.9	98.0	95.7	97.2	78.2	85.7

Simulation Task	$\pi_{0.5}$		X-VLA		Ours
Simulation Task	Clean	Rand.	Clean	Rand.	Clean	Rand.
Place Dual Shoes	12	7	79	88	88	86
Move Stapler Pad	16	18	78	73	61	60
Stack Blocks Two	48	56	92	87	98	99
Scan Object	42	38	14	36	91	87
Place Object Stand	74	65	86	88	90	87
Place Fan	25	36	80	75	90	94
Move Pillbottle Pad	33	29	73	71	96	96
Pick Dual Bottles	10	6	47	36	88	76
Blocks Ranking Rgb	43	35	83	83	95	98
......(50 tasks)	-	-	-	-	-	-
Turn Switch	5	6	40	61	55	72
Pick Diverse Bottles	5	3	58	36	80	74
Place Bread Basket	48	56	81	71	86	83
Stack Blocks Three	15	16	6	10	91	84
Put Bottles Dustbin	12	9	74	77	62	87
Place Can Basket	19	25	49	52	82	73
Stamp Seal	36	23	76	82	81	88
Handover Block	18	19	73	37	87	89
Stack Bowls Three	33	35	76	86	81	92
Place Object Basket	43	36	44	39	85	93
Open Microwave	35	37	79	71	93	90
Average	42.98	43.84	72.80	72.84	85.18	86.06

Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Abstract

Key Insights

Spatial Perception Bottleneck

Action Manifold Hypothesis

Methodology

Geometry-Guided Gated Transformer ($\text{G}^3\text{T}$)

Action Manifold Learning (AML)

Experimental Results

Real-World Experiments

Zero-shot Generalization

Citation