Stop Picking Sides: VLAs, JEPA, World Foundational Models, and WAMs Are All Solving Different Problems

Five terms keep appearing in every robotics paper and every conference talk right now: VLA, JEPA, World Foundation Model, World Action Model, Steerable VLA. They get used interchangeably, or worse, po

May 11, 2026 19 min read physical AIroboticsMachine LearningDeep LearningVLAWorld Modelsworld-action-modelsembodied-aillmreasoning-modelschain-of-thought-modelsNVIDIAresearch Read on Hashnode ↗

They're not. They're different layers of the same stack — each one solving a specific failure mode in what came before it. This post maps all of them: what each one actually does, where it breaks, and how they fit together. No narrative about "why now" — that's Part 2. This is the technical map.

VLAs: Where Everything Started

The basic idea of a Vision-Language-Action model: take a pretrained vision-language model that already understands what "cup," "shelf," and "carefully" mean from billions of internet examples, and extend it to output robot motor commands instead of just text.

RT-2 (arXiv:2307.15818, Google DeepMind, 2023) established the paradigm by co-fine-tuning PaLI-X on robot data with actions encoded as text tokens. The semantic generalization was genuine — you could describe a novel object and the robot had a fighting chance. OpenVLA (arXiv:2406.09246, Stanford, 2024) is the one to actually use if you're doing open research: 7B parameters, fully open weights, trained on the Open X-Embodiment dataset across 22 robot embodiments.

π₀ (arXiv:2410.24164, Physical Intelligence, 2024) pushed the action representation forward — replacing discrete text tokens with a flow-matching head that generates smooth, continuous joint trajectories at up to 50Hz. That's the difference between a model that picks things up and a model that folds laundry.

The 2025 wave — Figure Helix, GR00T N1 (arXiv:2503.14734), Gemini Robotics — all converged on what's now called the dual-system pattern: a slow VLM reasoning module at 1–10Hz feeding latent goals into a fast visuomotor policy at 50–200Hz. Not an elegant design choice. An engineering constraint: a 7B-parameter transformer cannot close a control loop at 100Hz. So you split the brain.

"VLMs benefit from internet-scale knowledge, but are trained on objectives that emphasize visual and semantic understanding over prediction of physical dynamics. Tens of thousands of hours of costly robot data are needed to teach a model how to solve tasks considered simple for a human."
— 1X Technologies, 1X World Model blog, 2025

Where VLAs break:

The language backbone gives you a strong prior over the visual world, but it's the wrong prior for physical interaction. The model has seen pictures of a glass being gripped. It has never felt one slip. This shows up as distribution shift on contact-rich tasks: the demo pipeline works beautifully, and falls apart the moment lighting changes or an object is placed at an unusual angle.

Long-horizon tasks are a separate structural problem — error compounds across 30 sequential steps until catastrophic failure. Reactive policies have no mechanism to notice they're drifting. That's what the next two architectures were built to fix.

Reactive VLAs vs. Reasoning VLAs: The SmolVLA Question

SmolVLA (arXiv:2506.01844, Hugging Face, 2025) is the clearest example of the reactive approach taken to its logical end: 450M parameters, consumer hardware, (image, language) → action in a single forward pass. No deliberation. No intermediate reasoning step.

For short-horizon, fixed-distribution tasks — pick and place a known object, close a cabinet — this is exactly what you want. Fast, cheap, deployable today.

The SmolVLA paper is also refreshingly honest about a real problem in the field:

"Much of the impactful VLA progress remains proprietary, with many models sharing only weights while withholding full training details and essential methodological components."
— SmolVLA, Hugging Face, 2025 (arXiv:2506.01844)

What reactive VLAs genuinely cannot do: handle conditional logic ("only if the cup is already there"), maintain coherent plans across 20+ steps, or self-correct when an early step fails. The math is simple — error at step 2 causes failure at step 8. No mechanism to catch it.

Reasoning VLAs borrow chain-of-thought prompting from LLMs: force the model to generate intermediate reasoning steps before acting, and it generalises better on hard tasks. The embodied version is ECoT (arXiv:2407.08693, Zawalski et al., 2024):

"ECoT increases the absolute success rate of OpenVLA by 28% across challenging generalization tasks, without any additional robot training data. Additionally, ECoT makes it easier for humans to interpret a policy's failures and correct its behavior using natural language."
— Zawalski et al., ECoT (arXiv:2407.08693)

28% absolute improvement, no new robot data. That's a real number.

The field has since branched into several CoT and reasoning flavours:

Approach	Examples	Best for	Cost
Text CoT	ECoT, RAD	Long-horizon logic, interpretability	Slow (1–5Hz)
Visual CoT	ThinkAct (arXiv:2507.16815)	Spatial precision, self-correction	Medium speed
Latent CoT	LaRA-VLA (arXiv:2602.01166), Fast-ThinkAct (arXiv:2601.09708)	Near-reactive latency + generalization	Uninterpretable
Dual CoT	DualCoT-VLA (arXiv:2603.22280)	Best of text + visual	Complex training
Explicit CoT (Chain of Causation)	Alpamayo-R1 (arXiv:2511.00088)	AV edge cases, explainability	Inference overhead
RL-optimized reasoning (no explicit CoT)	Poutine (arXiv:2506.11234)	Long-tail, OOD robustness	Requires preference data

These last two are worth distinguishing clearly, because they solve the same problem differently. Alpamayo-R1 is chain-of-thought in the strict sense: the model generates explicit, structured "Chain of Causation" reasoning traces — spelling out causal relationships in the driving scene before producing a trajectory. Interpretable; you can read why it made the decision. Poutine doesn't generate CoT text at inference at all. It uses Vision-Language-Trajectory pretraining (self-supervised next-token prediction over vision, language, and trajectory tokens jointly) followed by GRPO reinforcement learning finetuning on a small set of human preference-labelled frames. The "reasoning" is implicit — shaped into the model's weights through RL, not written out as tokens at runtime.

Poutine's result is the one that silences skeptics: a 3B-parameter model, no handcrafted tokenisers, no custom components, first place in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin, with validation performance nearly matching Waymo's own expert ground-truth trajectories. That's the benchmark validation that implicit RL-reasoning works on hard, real-world long-tail scenarios — and that you don't need explicit CoT tokens to get there.

The practical rule: reactive VLA for short-horizon fixed tasks. Add reasoning the moment your task is multi-step, conditional, or requires recovery. For tight latency budgets, latent CoT or RL-optimized variants recover most of the generalization benefit without the token overhead.

JEPA: Building the Right Foundation

While the VLA world was adding action heads to language models, Yann LeCun was making a different argument: the training objective itself is wrong.

JEPA — Joint Embedding Predictive Architecture — doesn't train a model to reconstruct pixels or predict the next token. It trains a model to predict the abstract representation of the future given the abstract representation of the present. Prediction happens entirely in embedding space. The model never has to regenerate irrelevant low-level details — only what semantically matters.

"JEPA models operate at a higher level of abstraction. By making predictions in the abstract representation space, the model can ignore unnecessary details and concentrate on the high-level information present in the data."
— Drozdov, Shwartz-Ziv & LeCun, NYU / Meta, 2024

The family has grown steadily: I-JEPA (arXiv:2301.08243, 2023) for images, V-JEPA (arXiv:2404.08471, 2024) extending to video, V-JEPA 2 (arXiv:2506.09985, 2025) combining ~1M hours of video with robot trajectory data and showing results on physical planning, and V-JEPA 2.1 (March 2026) tightening temporal consistency.

The honest gap: JEPA learns excellent physical representations but doesn't output motor commands. There's no language grounding out of the box. Closing the loop from "here is a rich latent representation of the world" to "here is a joint trajectory" is still active research. Whether V-JEPA 2's physical reasoning translates to manipulation performance at π₀'s level is an empirical question without clean cross-benchmarks yet.

LeCun's argument is a decade play — the physical priors from JEPA-style pretraining will eventually outcompete the semantic priors VLAs borrow from text. That may well be right. "Right in principle" and "ready to deploy today" are still different statements.

World Foundation Models: The Neural Simulator

A World Foundation Model (WFM) is not a controller. It doesn't output motor commands. It takes an observation and an action as inputs and predicts what happens next — as video frames or latent representations. A simulator that learned physics from watching humans, rather than from hand-authored dynamics.

NVIDIA Cosmos (cosmos.nvidia.com) is the most production-ready today. DreamDojo (arXiv:2602.06949, NVIDIA GEAR Lab, 2026) is the more interesting story — trained on 44,000 hours of first-person human video, no robot data, no physics engine. It uses continuous latent actions extracted self-supervised between video frames as a universal proxy for "what caused this state transition," regardless of hardware. Post-trained on small-scale robot data, it runs at 10 FPS for over a minute of continuous rollout, and policy evaluation success rates inside the "dream" correlate near-perfectly with real-world results (r = 0.995).

"It's Simulation 2.0. Real-world robot learning is bottlenecked by time, wear, safety, and resets. DreamDojo tries to work around this by learning from humans first."
— Jim Fan, on DreamDojo, February 2026

The WFM vs. VLA distinction in one line: a WFM answers "if the robot does X, what happens?" A VLA answers "given what I see, what should I do?" Complementary, not competing. WFMs live in the training loop; VLAs run on the robot.

Where WFMs break: visually plausible video ≠ physically accurate simulation. Contact dynamics, deformable objects, liquids — still hard. DreamDojo's +17% real-world success from model-based planning is real and significant. It's still not a physics engine.

World Action Models: Collapsing the Stack

A World Action Model (WAM) collapses simulation and control into a single model trained jointly. Instead of training a world model and a policy separately and composing them at inference, a WAM learns "what does the world look like after this action" and "what action should I take" in one forward pass, with shared representations.

DreamZero (arXiv:2602.15922, NVIDIA GEAR Lab, 2026) is the clearest current example — 14B parameters, video diffusion backbone (not a language model), jointly predicting future frames and motor commands via inverse dynamics. Using only ~500 hours of teleoperation across 22 diverse environments, it achieved 2× better generalisation to unseen tasks than state-of-the-art VLAs. Just 12 minutes of egocentric human video on an unseen task improved performance by over 42%.

"Compared to VLAs, WAMs learn best from diverse data, breaking away from the conventional wisdom that lots of repeated demos per task are the bread and butter. Diversity beats repetitions."
— Jim Fan, LinkedIn, January 2026

The x-embodiment argument is the most compelling thing about WAMs: pixels are the universal bridge. Different robots have wildly different kinematics. But they all move through the same visual world. A pixel-space backbone can learn from humans, from other robots, from anything with a camera — no fancy transfer algorithm needed.

Where WAMs still have problems: joint training on pixel prediction and motor command prediction is technically difficult — loss scales are completely mismatched. Training stability is still an open problem. The Fast-WAM paper (2026) found that for short-horizon tasks, the imagination component doesn't help much anyway — WAMs are most valuable exactly where they're hardest to train. Rigorous head-to-head benchmarks against best-in-class VLAs are still sparse.

Steerable VLAs: The Most Underrated Development of 2026

Two separate research threads use the word "steerable" to mean something subtly different, both solving the same underlying problem: raw VLAs aren't controllable enough for production deployment.

Thread 1: Hierarchical Command Abstraction

Steerable VLA Policies (arXiv:2602.13193, Chen et al., 2026) addresses the interface problem in hierarchical robotics. Today, a high-level VLM hands natural language instructions down to a low-level VLA. "Pick up the cup." That's the whole interface. The VLM can reason about task structure, but natural language can't express how — approach angle, grasp point, intermediate waypoints.

The solution: train the VLA on rich synthetic commands at multiple levels of abstraction — natural language subtasks, trajectory descriptions, and grounded pixel coordinates that literally point at image locations. The VLA learns to respond to all three. Now the high-level planner can express intent precisely, and an off-the-shelf VLM doesn't need fine-tuning — just prompting with examples of how to use the richer command vocabulary.

Results in real-world manipulation: outperformed prior hierarchical approaches on both generalisation and long-horizon tasks, without touching the VLM.

Thread 2: Activation Steering

Mechanistic Interpretability for VLA Steering (arXiv:2509.00328, Häon et al., 2025) addresses the operator control problem. Classical robotics pipelines gave you explicit handles — tune a gain, set a speed limit. Neural policies are black boxes. If the robot moves too fast, you retrain.

By projecting feedforward activations in VLA transformer layers onto the token embedding basis, the authors found sparse semantic directions — literally a "speed direction" and a "motion direction" — causally linked to what the robot does. Steer these directions at inference time, in real time, without fine-tuning.

"We introduce the first framework for interpreting and steering VLA models via their internal representations, enabling real-time, zero-shot behavioral control without fine-tuning."
— Häon et al., vla-mech-interp.github.io, 2025

Tested on π₀ and OpenVLA across LIBERO simulation and a physical UR5. Speed up. Slow down. Change direction. Zero-shot. No retraining.

Thread 3: π₀.7 — Steerable Generalist VLA in Production

π₀.7 (arXiv:2604.15483, Physical Intelligence, April 2026) is the most significant real-world steerable VLA to date — and arguably the clearest proof that the steerable approach scales to production. It's a 5B-parameter model built on Gemma3 4B with an 860M-parameter flow-matching action expert, conditioned on a richer context than any prior VLA: not just language commands, but subgoal images (predicted by a lightweight world model), episode metadata describing data quality and strategy, and execution style.

The key result is compositional generalisation — the ability to recombine skills learned in different contexts to solve tasks the model was never explicitly trained on. Earlier VLAs struggled here because their only steering input was natural language; there was no way to specify how to execute, only what to do. π₀.7's multimodal context conditioning closes that gap.

It also makes a striking training choice: heavy use of suboptimal robot data — failure episodes, demos with mistakes, data collected by earlier model versions during evaluation. Where most VLA training pipelines try to curate clean demonstrations, π₀.7 treats imperfect data as a feature. Combined with Knowledge Insulation and RECAP distillation during training, the model learns to distinguish quality and strategy from context rather than assuming the data is always correct.

The architecture puts a concrete face on what "steerable" actually means at scale: a model that takes direction at multiple levels of abstraction simultaneously — language, images, metadata — and responds coherently to all three.

Unlike WAMs or WFMs, steerable VLAs don't require rethinking your architecture. You get meaningfully more control from the model you already have.

How the Stack Actually Fits

Layer	What lives here	Role
Reasoning	Text CoT · Visual CoT · Latent CoT · RL-optimized (Poutine)	Plans & subgoal decomposition
Policy	Reactive VLA · Steerable VLA · Dual-system (SmolVLA, OpenVLA, π₀, Helix, GR00T N1)	Actions to robot
Simulation	World Foundation Model (DreamDojo, Cosmos, 1XWM)	Synthetic data & planning rollouts
Representation	JEPA (physical world priors) · World Action Model (collapses sim + policy into one)	Foundation

None of these is the final answer. Each addresses a real failure mode in what came before.

What to Actually Use

Short-horizon, fixed-distribution tasks → SmolVLA or OpenVLA-OFT. Use π₀-style flow matching if the task needs dexterity.

Long-horizon or conditional tasks → Add ECoT supervision. 28% is a real number. For tighter latency, use Fast-ThinkAct or LaRA-VLA.

Need reasoning and high-frequency control together → Dual-system: slow VLM at 1–10Hz feeding a fast visuomotor at 50–200Hz. What Helix and GR00T are doing.

Production deployment needing behavioral handles → Activation steering (Häon et al.). Fastest path to runtime controllability without retraining.

Building training infrastructure → DreamDojo. Fully open-source (Apache-2.0), 10 FPS, r=0.995 correlation with real-world results, +17% from model-based planning out of the box.

Research horizon: 2–3 years → WAM trajectory. EgoScale's scaling law means compute is the binding constraint now, not architecture.

References

VLAs — Standard & Reactive

[1] A. Brohan et al., "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control," Proc. 7th Conference on Robot Learning (CoRL), PMLR 229, pp. 2165–2183, 2023. [arXiv:2307.15818] · [Project]

[2] M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, "OpenVLA: An Open-Source Vision-Language-Action Model," arXiv preprint arXiv:2406.09246, 2024. [arXiv] · [GitHub]

[3] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, "π₀: A Vision-Language-Action Flow Model for General Robot Control," arXiv preprint arXiv:2410.24164, 2024. [arXiv] · [Blog]

[4] J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan et al., "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots," arXiv preprint arXiv:2503.14734, 2025. [arXiv]

[5] Google DeepMind, "Gemini Robotics: Bringing AI into the Physical World," Technical Report, 2025. [Blog]

[6] Hugging Face, "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics," arXiv preprint arXiv:2506.01844, 2025. [arXiv]

VLAs — Reasoning & CoT

[7] M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, "Robotic Control via Embodied Chain-of-Thought Reasoning," arXiv preprint arXiv:2407.08693, 2024. [arXiv]

[8] C.-P. Huang, Y.-H. Wu, M.-H. Chen, Y.-C. F. Wang, and F.-E. Yang, "ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning," Proc. 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. [arXiv:2507.16815] · [Project]

[9] C.-P. Huang, Y. Man, Z. Yu, M.-H. Chen, J. Kautz, Y.-C. F. Wang, and F.-E. Yang, "Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning," arXiv preprint arXiv:2601.09708, 2026. [arXiv]

[10] H. Tan, P. Co, Y. Xu, S. Rong, Y. Ji, C. Chi, X. Chen, Q. Zhang, Z. Zhao, P. Wang et al., "LaRA-VLA: Latent Thinking and Prediction for Vision-Language-Action Models," arXiv preprint arXiv:2602.01166, 2026. [arXiv]

[11] Z. Zhong et al., "DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models," arXiv preprint arXiv:2603.22280, 2026. [arXiv]

[12] L. Rowe, R. de Schaetzen, R. Girgis, C. Pal, and L. Paull, "Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving," arXiv preprint arXiv:2506.11234, 2025. [1st place, 2025 Waymo Vision-Based End-to-End Driving Challenge] [arXiv] · [Waymo Challenges]

[13] NVIDIA Research, "Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail," Proc. 39th Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. [arXiv:2511.00088]

Steerable VLAs

[14] Physical Intelligence, B. Ai, A. Amin, R. Aniceto, A. Balakrishna, G. Balke, K. Black et al., "π₀.7: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities," arXiv preprint arXiv:2604.15483, 2026. [arXiv] · [Blog]

[15] W. Chen et al., "Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control," arXiv preprint arXiv:2602.13193, 2026. [arXiv]

[16] B. Häon, K. Stocking, I. Chuang, and C. Tomlin, "Mechanistic Interpretability for Steering Vision-Language-Action Models," arXiv preprint arXiv:2509.00328, 2025. [arXiv] · [Project]

JEPA

[17] Y. LeCun, "A Path Towards Autonomous Machine Intelligence," OpenReview, Version 0.9.2, 2022. [OpenReview]

[18] M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas, "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA)," Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arXiv:2301.08243] · [GitHub]

[19] A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y. LeCun, M. Assran, and N. Ballas, "Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA)," Transactions on Machine Learning Research, 2024. [arXiv:2404.08471] · [GitHub]

[20] M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus et al., "V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning," arXiv preprint arXiv:2506.09985, 2025. [arXiv] · [Blog]

[21] Y. Huang, "VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models," arXiv preprint arXiv:2601.14354, 2026. [arXiv]

World Foundation Models & World Action Models

[22] NVIDIA, "Cosmos: World Foundation Model Platform for Physical AI," 2024–2025. [cosmos.nvidia.com] · [GitHub]

[23] S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y. Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y. Xie, R. Zheng, D. Niu, Y. L. Tan, K. R. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y. Liu, Y. Zhu, J. Jang, and L. Fan, "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos," arXiv preprint arXiv:2602.06949, 2026. [arXiv] · [GitHub] · [Project]

[24] 1X Technologies, "1X World Model," Technical Blog, 2025. [1x.tech]

[25] (Humanoid World Models authors), "Humanoid World Models," arXiv preprint arXiv:2506.01182, 2026. [arXiv]

[26] S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y. L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y. Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y. Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y. Du, Y. Chebotar, S. Reed, J. Kautz, Y. Zhu, L. Fan, and J. Jang, "World Action Models are Zero-shot Policies (DreamZero)," arXiv preprint arXiv:2602.15922, 2026. [arXiv] · [GitHub]

[27] H. Bi, H. Tan, S. Xie, Z. Wang, S. Huang, H. Liu, R. Zhao, Y. Feng, C. Xiang, Y. Rong, H. Zhao, H. Liu, Z. Su, L. Ma, H. Su, and J. Zhu, "Motus: A Unified Latent Action World Model," arXiv preprint arXiv:2512.13030, 2025. [arXiv] · [GitHub]

Datasets

[28] Open X-Embodiment Collaboration, A. O'Neill, A. Rehman, A. Gupta et al., "Open X-Embodiment: Robotic Learning Datasets and RT-X Models," arXiv preprint arXiv:2310.08864, 2023. [arXiv] · [Dataset]

→ Continue to Part 2: Three People Who Never Agree Just Said the Same Thing About Robotics