TL;DR: Pri4R improves VLA’s physical interaction capabilities by leveraging privileged 4D supervision to instill an implicit understanding of world dynamics, incurring no inference overhead and yielding large gains on manipulation tasks.

Introducing the Pri4R

Pri4R equips Vision Language Action (VLA) models with an implicit awareness of action-world dynamics through privileged 4D geometric supervision. During training, Pri4R leverages 4D geometry as an auxiliary signal to revise the VLM’s latent representation into a dynamics-aware state. Concretely, it injects the VLM’s internal embeddings into a dedicated point-tracking head and jointly predicts actions and future 3D point trajectories. This joint objective guides the backbone to encode the causal relationship between actions and the resulting geometric evolution of the scene, which directly strengthens the action prediction head for more accurate and robust behavior—while preserving the standard VLA interface with zero inference overhead at test time.

Pipeline of Pri4R

Pri4R is a framework that incorporates privileged geometric information to improve the world dynamics understanding of VLA models. During training, we use high-fidelity 4D signals as an auxiliary supervision to refine the internal representations of the VLM backbone. By supervising the model to predict the physical evolution of the scene, we enable the VLA to develop a physically-aware context for robot control, without requiring any additional inputs or computational overhead during inference.

Dataset

3D Point Track Supervision

Real World

Simulation

Analyis of 3D Point Track Supervision

Temporality & Spatiality

Temporality matters. When we relax supervision from temporally dense 3D point tracks to a temporally sparse 3D point set objective, the success rate improves only slightly—suggesting that dense, frame-to-frame guidance is a key driver of performance.

Spatiality matters. Replacing 3D point tracks with 2D point tracks yields only a marginal gain, indicating that accurate 3D geometric supervision is crucial.

Overall, the largest improvements come from temporally dense and 3D supervision together: 3D point tracks provide the strongest signal because they jointly capture motion dynamics (temporality) and geometry (spatiality).

Spatial Redundancy

We compare 3D point tracks with spatially dense supervision such as depth. While depth helps over the baseline, it consistently underperforms 3D point tracks. The key reason is that depth is often spatially redundant (especially under a mostly static view) and lacks point identity across time, making it a weaker signal for learning interaction-driven scene changes.

Scene–Robot Interaction

We ablate tracking only the environment or only the robot versus our world tracking (environment + robot). Neither alone matches world tracking, indicating that explicitly modeling robot–environment interaction is crucial for improved performance in complex manipulation.

Multitask Simulation Benchmarks

We evaluate Pri4R on two multi-task simulation benchmarks: LIBERO and RoboCasa. LIBERO is a language-conditioned tabletop manipulation benchmark that tests multitask generalization across task families with varying spatial layouts, target goals, and object configurations; we report results on four suites (Spatial, Object, Goal, and Long), each with 10 tasks and 50 demonstrations per task. RoboCasa is a large-scale kitchen manipulation benchmark with diverse scenes and articulated interactions; we evaluate 24 atomic tasks using the Human-50 dataset, which provides 50 demonstrations per task.