Three People Who Never Agree Just Said the Same Thing About Robotics

May 11, 2026 12 min read physical AIembodied-aitechnologyresearchroboticsDeepLearningVLAgpt Read on Hashnode ↗

Something unusual is happening in physical AI right now. Not unusual in the sense of a single dramatic breakthrough. Unusual in the sense that people who almost never agree — a researcher who co-authored the two most cited VLA papers in existence, the CEO of the company behind DeepMind, the transformer paper, and the TPU, and an NVIDIA director with a published scaling law — are all independently landing on the same framing within the same few weeks.

That kind of convergence is worth paying attention to. Not because any one of them has the full picture, but because when independent signals triangulate on the same conclusion, it usually means something real is underneath.

Let me lay out what they're each saying, and then what I think it actually means.

Signal 1: Quan Vuong — "π₀ Was the GPT-1 Moment. We're Past It."

Quan Vuong is a co-author of RT-2 and π₀ — the two papers that arguably defined the VLA paradigm and set the benchmark every subsequent model gets measured against. He's not an observer of this field. He helped build it.

In April 2026, Vuong appeared on Y Combinator's Lightcone podcast in an episode titled "The GPT Moment for Robotics Is Here" and placed π₀ — a model he co-built — in historical context:

"Physical Intelligence is building a foundation model that can control any robot to do any task — what the team describes as the GPT moment for robotics. The company's cross-embodiment approach trains across many different robot platforms, and recent results show tasks being performed zero-shot that last year required hundreds of hours of data collection."
— Quan Vuong, Physical Intelligence, Lightcone Podcast, April 2026 · ycombinator.com/library/NS-robots-are-finally-starting-to-work

The specific framing Vuong used: π₀ was the GPT-1 moment — the proof that a single foundation model could generalise across robots and tasks in the same way GPT-1 proved language models could generalise across text. And by April 2026, he's describing it in the past tense. The GPT-1 moment already happened. The field has already moved past it. The question now is what the GPT-2 and GPT-3 equivalents look like — and whether the trajectory holds.

That's not a pitch. That's a researcher who built the reference model telling you where it sits in the arc — and placing it behind us, not ahead.

In the same month, Chelsea Finn — PI co-founder and Stanford professor — spoke at YC's AI Startup School and traced the arc from early robotic grasping experiments to today's work on folding laundry and generalising across kitchen tasks, all without hand-crafted code. Her framing: robots equipped with generalizable physical intelligence can adapt and assist in the unpredictable world around us. Not in theory. Now.

These are people who have spent careers being careful about overpromising on robotics. The shift in their language is the signal.

Signal 2: Sundar Pichai — "We Were Too Early"

In early April 2026, Sundar Pichai sat down with Stripe's Patrick Collison and Elad Gil on the Cheeky Pint podcast for 72 minutes. No slides, no rehearsed talking points. The most interesting line was also the most understated:

Google was previously too early to robotics. AI has become the missing ingredient for ideas conceived 10 to 15 years ago.

That's worth sitting with. Google has been doing robotics research since at least 2013. They helped build the dataset that trained half the VLAs in use today. They invented half the transformer architectures the field runs on. And Pichai is saying they were too early.

Not that robotics was impossible. That the enabling technology — specifically, the kind of foundation models that can provide the visual, semantic, and physical grounding that makes general manipulation possible — simply didn't exist yet. And now it does.

He then told TIME, in an interview the same week ("Sundar Pichai Reveals What AI Will Do Next," April 30, 2026):

"AI is reshaping decision-making. The rise of AI assistants marks a real shift — and Google's role is to build this technology responsibly."
— Sundar Pichai, TIME, April 2026

The word "responsibly" is doing real work in that sentence. Pichai has been consistent about this across multiple appearances: the governance gap between what these systems can do and the frameworks we have to oversee them is real. Software agents that make bad decisions corrupt databases. Physical agents that make bad decisions can hurt people. That's a harder problem, and it's one robotics specifically has to solve.

Gemini Robotics, he noted in the Cheeky Pint interview, has now reached state-of-the-art status for spatial reasoning. Google has partnered back with Boston Dynamics and Agile Robotics. Wing, their drone delivery service, is on track to reach 40 million Americans in what Pichai described as "not years out."

This is not a company hedging on physical AI. This is a company that spent over a decade being "too early" and believes the conditions have finally changed.

Signal 3: Jim Fan — "The Great Parallel"

The most systematic framing came from Jim Fan, NVIDIA's Director of AI, at Sequoia's AI Ascent in April 2026. His argument is worth understanding in full because it's not just a "robotics is exciting" talk — it's a specific structural claim about why now and what comes next.

Fan's thesis, which he calls the Great Parallel: robotics is following the exact same playbook as large language models, one stage at a time.

What LLMs did	What robotics is doing now
Pre-train on text, predict next token	Pre-train on video, predict next world state
InstructGPT alignment fine-tuning	Action fine-tuning on robot trajectories
o1 / o3 reasoning via RL	Massively parallel RL inside neural simulators
Language model backbone	World model backbone — not language
GPT-2 proof-of-concept	DreamZero (roughly here)

"Our generation was born too late to explore the earth and too early to explore the stars — but we were born just in time to solve robotics."
— Jim Fan, Sequoia AI Ascent, April 2026 · youtube.com/watch?v=3Y8aq_ofEVs

The poetry is good. But what actually matters is what comes right after it in the talk: the data.

NVIDIA's EgoScale project pre-trained GR00T N1.5 on 21,000 hours of in-the-wild egocentric human video — zero robot data — and then fine-tuned with only 4 hours of teleoperation. The result: a near-perfect log-linear scaling law between human video volume and action prediction loss. R² = 0.998.

That curve looks like this:

More human video hours → lower action prediction loss → higher real-robot success rate. The relationship is log-linear. Clean. Predictable.

The same kind of curve Kaplan et al. found for language models in 2020. The curve that told the entire NLP field: compute is now the bottleneck, not ideas. Scale up training and the performance will follow.

If that curve holds for robotics — and it appears to — then the field has found its scaling law. The question is no longer "can we make robots that generalise?" It's "how much compute do we want to put in?"

Fan's critical phrase, repeated across the talk: "compute now equals environment equals data." Once you have a good world model, you can run RL inside it to generate unlimited synthetic experience. You're no longer bottlenecked by how many robot arms you own. The world model is the training environment, and the training environment is data.

Why These Three Signals Matter Together

Each of these voices carries a different kind of authority — and that's the point.

Quan Vuong is the closest to the actual research. He co-authored the papers that defined the last two years of VLA development — including π₀, which he's now describing as the GPT-1 moment in the past tense. That's a precise and significant statement: the first proof-of-concept stage is already behind us. When he says "GPT moment," he's not pattern-matching to a trend piece. He's telling you the benchmark numbers crossed a threshold he can name — and that the field has moved past it.

Sundar Pichai carries the weight of institutional history. Google has been running robotics programs since 2013, contributed foundational transformer architectures, and co-built Open X-Embodiment. When Pichai says "we were too early," that's not humility — it's a precise technical statement about what was missing and when it arrived. Alphabet's $180–190 billion 2026 capex is directed at AI compute and infrastructure broadly — the same stack that powers Gemini Robotics, DeepMind's physical AI work, and Google's robotics partnerships. When the CEO of that company says physical AI is now ready, it lands differently than when a startup says it.

Jim Fan brings the most falsifiable claim of the three. The Great Parallel isn't a metaphor — it's a prediction with evidence attached. EgoScale's R² = 0.998 scaling curve and DreamZero's cross-embodiment transfer numbers are published and reproducible. If the parallel holds, the field's trajectory from here is predictable in the same way language model scaling was predictable after 2020.

Three different kinds of authority — researcher who built it, executive who has been trying for a decade, scientist with a published scaling law — all converging on the same conclusion: **the enabling conditions for general-purpose physical AI are now in place.

What It Means in Practice

The technical details of how VLAs, WFMs, WAMs, and JEPA fit together are in Part 1. But the broader implication of these converging signals is simpler:

The debate about whether general-purpose physical AI will happen is effectively over. The open questions are when, how fast, and who controls the stack when it gets there.

The scaling law is the decisive evidence. Once you have a clean log-linear curve connecting compute to real-robot performance, the rest is engineering and investment. Language models taught us that once you find the scaling law, the field moves faster than almost anyone predicted. There is no reason to expect robotics to be different.

The governance question — Pichai's "responsibly" — is the one that hasn't been answered yet. Physical agents failing in the real world is categorically different from language models hallucinating. The steerable VLA work (see Part 1) is the beginning of an answer: interpretable activation steering, hierarchical command abstraction, runtime behavioral handles. But these are early-stage tools relative to what full-scale deployment will require.

The field has its scaling law. It doesn't yet have its safety framework. That gap is the most important thing to work on now.

The One-Line Version

Quan Vuong — who co-built π₀ — says the GPT-1 moment already happened and we're past it. Jim Fan — who has the scaling law — says compute is now the binding constraint, not ideas. Sundar Pichai says the missing ingredient for robotics just arrived. When three people with that much at stake converge on the same framing in the same week, the right response is to pay attention.

We were born just in time.

References

[1] Q. Vuong, "The GPT Moment for Robotics Is Here," Y Combinator Lightcone Podcast, April 2026. [YC Library] · [YouTube] · [Apple Podcasts]

[2] C. Finn, "Building Robots That Can Do Anything," Y Combinator AI Startup School, San Francisco, June 17, 2025. [YC Podcast]

[3] S. Pichai, "Sundar Pichai Reveals What AI Will Do Next," TIME, April 30, 2026. [YouTube] · [TIME]

[4] S. Pichai (with P. Collison and E. Gil), "The History and Future of AI at Google," Cheeky Pint / Stripe, April 7, 2026. [Transcript]

[5] S. Pichai, "Sundar Pichai on US AI Leadership," CBS 60 Minutes, April 12–13, 2026. [cbsnews.com]

[6] L. Fan (J. Fan), "Nvidia's Jim Fan on the End Game for Robotics," Sequoia AI Ascent, April 30, 2026. [YouTube]

[7] L. Fan, "DreamZero & EgoScale," LinkedIn, January–February 2026. [LinkedIn]

[8] S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu et al., "World Action Models are Zero-shot Policies (DreamZero)," arXiv preprint arXiv:2602.15922, 2026. [arXiv] · [GitHub]

[9] S. Gao, W. Liang, K. Zheng, A. Malik, S. Ye, S. Yu, W.-C. Tseng, Y. Dong, K. Mo, C.-H. Lin, Q. Ma, S. Nah, L. Magne, J. Xiang, Y. Xie, R. Zheng, D. Niu, Y. L. Tan, K. R. Zentner, G. Kurian, S. Indupuru, P. Jannaty, J. Gu, J. Zhang, J. Malik, P. Abbeel, M.-Y. Liu, Y. Zhu, J. Jang, and L. Fan, "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos," arXiv preprint arXiv:2602.06949, 2026. [arXiv] · [GitHub]

[10] K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, "π₀: A Vision-Language-Action Flow Model for General Robot Control," arXiv preprint arXiv:2410.24164, 2024. [arXiv] · [Blog]

[11] J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan et al., "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots," arXiv preprint arXiv:2503.14734, 2025. [arXiv]

[12] Google DeepMind, "Gemini Robotics: Bringing AI into the Physical World," Technical Report, 2025. [Blog]

[14] CNBC, "Alphabet ups 2026 capex to as much as $190 billion, expects to 'significantly increase' in 2027," CNBC, April 29, 2026. [cnbc.com]

[15] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, "Scaling Laws for Neural Language Models," arXiv preprint arXiv:2001.08361, 2020. [arXiv]

← Back to Part 1: Stop Picking Sides — VLAs, JEPA, World Models, and WAMs Are All Solving Different Problems