Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation

Junjin Xiao1* Dongyang Li1* Yandan Yang2 Shuang Zeng3 Tong Lin3 Xinyuan Chang2 Feng Xiong2 Mu Xu2 Xing Wei3 Zhiheng Ma4† Qing Zhang1† Wei-Shi Zheng1
1Sun Yat-sen University   2Amap, Alibaba Group
3Xi'an Jiaotong University   4Shenzhen University of Advanced Technology
*Equal Contribution, †corresponding Authors

Abstract

This paper addresses the challenges of spatial perception and manipulation in Vision-Language-Action (VLA) models within complex environments. While recent works have attempted to enhance VLA perception ability by injecting or distilling features from 3D foundation models, they, however, achieve only limited spatial understanding accuracy improvement, due to biased geometric predictions led by inherent depth ambiguity under monocular input conditions. Furthermore, existing action generation methods rely on indirect paradigms that predict high-dimensional noise or velocity. Regressing these unstructured targets imposes a significant optimization burden, which intensifies as the action dimensionality increases, thereby hindering the efficient learning of complex robotic policies.

To address these issues, we present a VLA framework with the following two novel designs. First, to tackle the depth ambiguity from monocular inputs, we propose to leverage pre-trained multi-view diffusion models to synthesize novel views in latent space, obtaining enriched scene context with largely reduced geometric uncertainties. To effectively integrate these multi-view latent priors, we present Geometry-Guided Gated Transformer (G3T), which is designed to align multi-view latent features under the guidance of monocular 3D geometric priors and selectively aggregate informative views while suppressing noise from occluded regions based on an adaptive gating mechanism. Second, to overcome the optimization inefficiencies, we introduce Action Manifold Learning (AML). Unlike traditional methods that decode abstract noise or velocity, AML shifts the prediction target to direct action estimation, enabling the policy to focus on learning the intrinsic structure of valid actions for more efficient and robust execution. Extensive evaluations on LIBERO, LIBERO-Plus, RoboTwin 2.0, and real-world robot experiments demonstrate that our method outperforms state-of-the-art baselines in both success rate and robustness.

Teaser Comparison
Figure 1. Methodology comparison. Existing VLA models either rely on expensive RGB-D sensors for explicit 3D input (a) or suffer from severe depth ambiguity under monocular settings (b). In contrast, our method leverages multi-view diffusion prior and Geometry-Guided Gated Transformer ($\text{G}^3\text{T}$) to synthesize robust geometric features from a single RGB image, resolving the depth ambiguity without utilizing extra hardware (c). As shown in (d), our method demonstrates stability against disturbances under LIBERO-Spatial tasks. Dark bars: success rate under perturbation, light bars: original result. Our approach exhibits minimal degradation (8.0) compared to baselines.

Key Insights

Spatial Perception Bottleneck

Current VLA models struggle with monocular depth ambiguity. Injecting features from 3D foundation models often introduces noise due to the ill-posed nature of single-view geometry recovery. We argue that synthesizing multi-view latent priors provides complementary geometric cues that resolve this ambiguity without extra hardware.

Action Manifold Hypothesis

Traditional diffusion policies predict abstract noise ($\epsilon$-prediction), which is high-dimensional and unstructured. We posit that meaningful robot actions reside on a low-dimensional Action Manifold. By directly predicting clean actions on this manifold (AML), we eliminate unnecessary stochasticity and ensure more efficient model optimization.

Methodology

Overall Framework
Figure 2. Overview of our method. The framework decouples semantic perception (Qwen3-VL) from geometric awareness. The Geometry Module combines VGGT monocular priors with synthesized multi-view latents via $\text{G}^3\text{T}$. The fused features guide the Action Manifold Learning (AML) expert to predict stable action chunks.

Geometry-Guided Gated Transformer ($\text{G}^3\text{T}$)

To robustly integrate spatial cues, $\text{G}^3\text{T}$ aligns monocular geometric priors (from VGGT) with synthesized multi-view latent features. It employs an adaptive gating mechanism to dynamically weigh the reliability of each view, effectively filtering out occlusions and resolving depth ambiguities.

G3T Module
Figure 3. $\text{G}^3\text{T}$ Architecture. Fusing monocular spatial tokens and synthesized multi-view tokens to produce robust, occlusion-aware spatial representations.

Action Manifold Learning (AML)

Unlike traditional methods that predict noise, AML uses a Diffusion Transformer (DiT) to directly estimate clean action chunks $\hat{A}_t$ on a low-dimensional manifold. We optimize using a velocity-consistent loss, which preserves the stability of flow matching while focusing the model's capacity on learning intrinsic action semantics.

Action Manifold Hypothesis
Figure 4. Action Manifold Hypothesis. Meaningful action sequences reside on a low-dimensional manifold. Conventional noise/velocity targets are off-manifold and high-dimensional.

Experimental Results

Table 1. Evaluation results on LIBERO benchmark.
Method Spatial Object Goal Long Average
Diffusion Policy78.587.573.564.876.1
OpenVLA84.788.479.253.776.5
SpatialVLA88.289.978.655.578.1
CoT-VLA87.591.687.669.083.9
$\pi_0$-Fast96.496.888.660.285.5
GR00T-N194.497.693.090.693.9
$\pi_0$98.096.894.488.494.4
F198.297.895.491.395.7
InternVLA-M198.099.093.892.695.9
Discrete Diffusion VLA97.298.697.492.096.3
$\pi_{0.5}$98.898.298.092.496.9
GR00T-N1.697.798.597.594.497.0
OpenVLA-OFT97.698.497.994.597.1
UniVLA96.596.895.692.095.2
X-VLA98.298.697.897.698.1
GeoVLA98.499.096.696.697.7
3D-CAVLA98.299.898.296.198.1
Spatial Forcing99.499.698.896.098.5
Ours 98.8 99.8 99.0 96.6 98.6
Table 2. Zero-shot performance on LIBERO-Plus.
Method Camera Robot Language Light Background Noise Layout Total
OpenVLA0.83.523.08.134.815.228.515.6
OpenVLA-OFT56.431.979.588.793.375.874.269.6
OpenVLA-OFT_w10.438.770.576.893.649.969.955.8
Openvla-OFT_m55.621.781.092.791.078.668.767.9
NORA2.237.065.145.758.612.862.139.0
WorldVLA0.127.941.643.717.110.938.025.0
UniVLA1.846.269.669.081.021.231.942.9
$\pi_0$13.86.058.885.081.479.068.953.6
$\pi_0$-Fast65.121.661.073.273.274.468.861.6
RIPT-VLA55.231.277.688.491.673.574.268.4
MergeVLA50.730.366.084.285.766.068.162.5
UnifoLM-VLA-056.769.591.293.695.377.977.978.9
Ours 89.6 60.1 86.9 98.0 95.7 97.2 78.2 85.7
Table 3. Evaluation results on RoboTwin 2.0 Benchmark.
Simulation Task $\pi_{0.5}$ X-VLA Ours
CleanRand. CleanRand. CleanRand.
Place Dual Shoes12779888886
Move Stapler Pad161878736160
Stack Blocks Two485692879899
Scan Object423814369187
Place Object Stand746586889087
Place Fan253680759094
Move Pillbottle Pad332973719696
Pick Dual Bottles10647368876
Blocks Ranking Rgb433583839598
......(50 tasks)------
Turn Switch5640615572
Pick Diverse Bottles5358368074
Place Bread Basket485681718683
Stack Blocks Three15166109184
Put Bottles Dustbin12974776287
Place Can Basket192549528273
Stamp Seal362376828188
Handover Block181973378789
Stack Bowls Three333576868192
Place Object Basket433644398593
Open Microwave353779719390
Average 42.9843.84 72.8072.84 85.1886.06

Real-World Experiments

Demonstrating robust manipulation capabilities in diverse physical scenarios with smooth trajectory execution.

Insert pink cube into red cup
Insert pink cube into red cup
Place cylinder on green block.
Place cylinder on green block.
Place yellow cup on red block.
Place yellow cup on red block.
Stack blue block on red block.
Stack blue block on red block.

Zero-shot Generalization

Test on unseen tasks and objects to validate generalziation ability.

Insert pink cube into blue cup
Insert pink cube into blue cup
Insert pink cube into blue cup
Insert pink cube into blue cup
Place cylinder on red block.
Place cylinder on red block.
Place cylinder on red block.
Place cylinder on red block.
Place orange cup on red block.
Place orange cup on red block.
Place orange cup on red block.
Place orange cup on red block.
Stack red block on blue block.
Stack red block on blue block.
Stack red block on green block.
Stack red block on green block.

Citation

@article{xiao2026learning, title={Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation}, author={Junjin Xiao and Dongyang Li and Yandan Yang and Shuang Zeng and Tong Lin and Xinyuan Chang and Feng Xiong and Mu Xu and Xing Wei and Zhiheng Ma and Qing Zhang and Wei-Shi Zheng}, year={2026}, journal={arxiv:2605.11832}, }