This paper addresses the challenges of spatial perception and manipulation in Vision-Language-Action (VLA) models within complex environments. While recent works have attempted to enhance VLA perception ability by injecting or distilling features from 3D foundation models, they, however, achieve only limited spatial understanding accuracy improvement, due to biased geometric predictions led by inherent depth ambiguity under monocular input conditions. Furthermore, existing action generation methods rely on indirect paradigms that predict high-dimensional noise or velocity. Regressing these unstructured targets imposes a significant optimization burden, which intensifies as the action dimensionality increases, thereby hindering the efficient learning of complex robotic policies.
To address these issues, we present a VLA framework with the following two novel designs. First, to tackle the depth ambiguity from monocular inputs, we propose to leverage pre-trained multi-view diffusion models to synthesize novel views in latent space, obtaining enriched scene context with largely reduced geometric uncertainties. To effectively integrate these multi-view latent priors, we present Geometry-Guided Gated Transformer (G3T), which is designed to align multi-view latent features under the guidance of monocular 3D geometric priors and selectively aggregate informative views while suppressing noise from occluded regions based on an adaptive gating mechanism. Second, to overcome the optimization inefficiencies, we introduce Action Manifold Learning (AML). Unlike traditional methods that decode abstract noise or velocity, AML shifts the prediction target to direct action estimation, enabling the policy to focus on learning the intrinsic structure of valid actions for more efficient and robust execution. Extensive evaluations on LIBERO, LIBERO-Plus, RoboTwin 2.0, and real-world robot experiments demonstrate that our method outperforms state-of-the-art baselines in both success rate and robustness.
Current VLA models struggle with monocular depth ambiguity. Injecting features from 3D foundation models often introduces noise due to the ill-posed nature of single-view geometry recovery. We argue that synthesizing multi-view latent priors provides complementary geometric cues that resolve this ambiguity without extra hardware.
Traditional diffusion policies predict abstract noise ($\epsilon$-prediction), which is high-dimensional and unstructured. We posit that meaningful robot actions reside on a low-dimensional Action Manifold. By directly predicting clean actions on this manifold (AML), we eliminate unnecessary stochasticity and ensure more efficient model optimization.
To robustly integrate spatial cues, $\text{G}^3\text{T}$ aligns monocular geometric priors (from VGGT) with synthesized multi-view latent features. It employs an adaptive gating mechanism to dynamically weigh the reliability of each view, effectively filtering out occlusions and resolving depth ambiguities.
Unlike traditional methods that predict noise, AML uses a Diffusion Transformer (DiT) to directly estimate clean action chunks $\hat{A}_t$ on a low-dimensional manifold. We optimize using a velocity-consistent loss, which preserves the stability of flow matching while focusing the model's capacity on learning intrinsic action semantics.
| Method | Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|---|
| Diffusion Policy | 78.5 | 87.5 | 73.5 | 64.8 | 76.1 |
| OpenVLA | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| SpatialVLA | 88.2 | 89.9 | 78.6 | 55.5 | 78.1 |
| CoT-VLA | 87.5 | 91.6 | 87.6 | 69.0 | 83.9 |
| $\pi_0$-Fast | 96.4 | 96.8 | 88.6 | 60.2 | 85.5 |
| GR00T-N1 | 94.4 | 97.6 | 93.0 | 90.6 | 93.9 |
| $\pi_0$ | 98.0 | 96.8 | 94.4 | 88.4 | 94.4 |
| F1 | 98.2 | 97.8 | 95.4 | 91.3 | 95.7 |
| InternVLA-M1 | 98.0 | 99.0 | 93.8 | 92.6 | 95.9 |
| Discrete Diffusion VLA | 97.2 | 98.6 | 97.4 | 92.0 | 96.3 |
| $\pi_{0.5}$ | 98.8 | 98.2 | 98.0 | 92.4 | 96.9 |
| GR00T-N1.6 | 97.7 | 98.5 | 97.5 | 94.4 | 97.0 |
| OpenVLA-OFT | 97.6 | 98.4 | 97.9 | 94.5 | 97.1 |
| UniVLA | 96.5 | 96.8 | 95.6 | 92.0 | 95.2 |
| X-VLA | 98.2 | 98.6 | 97.8 | 97.6 | 98.1 |
| GeoVLA | 98.4 | 99.0 | 96.6 | 96.6 | 97.7 |
| 3D-CAVLA | 98.2 | 99.8 | 98.2 | 96.1 | 98.1 |
| Spatial Forcing | 99.4 | 99.6 | 98.8 | 96.0 | 98.5 |
| Ours | 98.8 | 99.8 | 99.0 | 96.6 | 98.6 |
| Method | Camera | Robot | Language | Light | Background | Noise | Layout | Total |
|---|---|---|---|---|---|---|---|---|
| OpenVLA | 0.8 | 3.5 | 23.0 | 8.1 | 34.8 | 15.2 | 28.5 | 15.6 |
| OpenVLA-OFT | 56.4 | 31.9 | 79.5 | 88.7 | 93.3 | 75.8 | 74.2 | 69.6 |
| OpenVLA-OFT_w | 10.4 | 38.7 | 70.5 | 76.8 | 93.6 | 49.9 | 69.9 | 55.8 |
| Openvla-OFT_m | 55.6 | 21.7 | 81.0 | 92.7 | 91.0 | 78.6 | 68.7 | 67.9 |
| NORA | 2.2 | 37.0 | 65.1 | 45.7 | 58.6 | 12.8 | 62.1 | 39.0 |
| WorldVLA | 0.1 | 27.9 | 41.6 | 43.7 | 17.1 | 10.9 | 38.0 | 25.0 |
| UniVLA | 1.8 | 46.2 | 69.6 | 69.0 | 81.0 | 21.2 | 31.9 | 42.9 |
| $\pi_0$ | 13.8 | 6.0 | 58.8 | 85.0 | 81.4 | 79.0 | 68.9 | 53.6 |
| $\pi_0$-Fast | 65.1 | 21.6 | 61.0 | 73.2 | 73.2 | 74.4 | 68.8 | 61.6 |
| RIPT-VLA | 55.2 | 31.2 | 77.6 | 88.4 | 91.6 | 73.5 | 74.2 | 68.4 |
| MergeVLA | 50.7 | 30.3 | 66.0 | 84.2 | 85.7 | 66.0 | 68.1 | 62.5 |
| UnifoLM-VLA-0 | 56.7 | 69.5 | 91.2 | 93.6 | 95.3 | 77.9 | 77.9 | 78.9 |
| Ours | 89.6 | 60.1 | 86.9 | 98.0 | 95.7 | 97.2 | 78.2 | 85.7 |
| Simulation Task | $\pi_{0.5}$ | X-VLA | Ours | |||
|---|---|---|---|---|---|---|
| Clean | Rand. | Clean | Rand. | Clean | Rand. | |
| Place Dual Shoes | 12 | 7 | 79 | 88 | 88 | 86 |
| Move Stapler Pad | 16 | 18 | 78 | 73 | 61 | 60 |
| Stack Blocks Two | 48 | 56 | 92 | 87 | 98 | 99 |
| Scan Object | 42 | 38 | 14 | 36 | 91 | 87 |
| Place Object Stand | 74 | 65 | 86 | 88 | 90 | 87 |
| Place Fan | 25 | 36 | 80 | 75 | 90 | 94 |
| Move Pillbottle Pad | 33 | 29 | 73 | 71 | 96 | 96 |
| Pick Dual Bottles | 10 | 6 | 47 | 36 | 88 | 76 |
| Blocks Ranking Rgb | 43 | 35 | 83 | 83 | 95 | 98 |
| ......(50 tasks) | - | - | - | - | - | - |
| Turn Switch | 5 | 6 | 40 | 61 | 55 | 72 |
| Pick Diverse Bottles | 5 | 3 | 58 | 36 | 80 | 74 |
| Place Bread Basket | 48 | 56 | 81 | 71 | 86 | 83 |
| Stack Blocks Three | 15 | 16 | 6 | 10 | 91 | 84 |
| Put Bottles Dustbin | 12 | 9 | 74 | 77 | 62 | 87 |
| Place Can Basket | 19 | 25 | 49 | 52 | 82 | 73 |
| Stamp Seal | 36 | 23 | 76 | 82 | 81 | 88 |
| Handover Block | 18 | 19 | 73 | 37 | 87 | 89 |
| Stack Bowls Three | 33 | 35 | 76 | 86 | 81 | 92 |
| Place Object Basket | 43 | 36 | 44 | 39 | 85 | 93 |
| Open Microwave | 35 | 37 | 79 | 71 | 93 | 90 |
| Average | 42.98 | 43.84 | 72.80 | 72.84 | 85.18 | 86.06 |
Demonstrating robust manipulation capabilities in diverse physical scenarios with smooth trajectory execution.
Test on unseen tasks and objects to validate generalziation ability.