Embodied AI is fast becoming the cornerstone of artificial general intelligence. Why? Because it’s the only branch of AI that does things in the physical world—not just chat, draw, or translate, but move, grasp, fetch, and build. At the heart of this evolution is a new family of models: Vision-Language-Action Models (VLAs)
What Are VLAs?
Think of a VLA as an AI brain that sees, reads, and acts—all at once. VLAs process images, understand natural language, and output physical actions. Unlike large language models (LLMs) like ChatGPT, which only generate text, VLAs control real robots, following your verbal instructions to interact with the world.
The concept exploded after Google’s RT-2, which showed that a model pre-trained on web-scale vision and language data could generalize surprisingly well to robotic control. Today, we’re witnessing an explosion of research in VLAs, and it’s time to put some order into the chaos.
VLA Taxonomy: The Big Picture
The research community has converged on a three-part taxonomy for VLAs:
- Component Focused:
- Foundation blocks: vision encoders, language encoders, world models, reasoning modules, etc.
- Control Policy Oriented:
- Low-level action policies: Given a perception and an instruction, produce the actual robot control commands (e.g., “move left”, “grasp”).
- Task Planner Driven:
- High-level planning: Break long-horizon goals into subtasks, e.g., decompose “clean up the room” into picking up toys, moving to the shelf, etc.
It’s a useful mental map: think in layers—perception, control, planning.
How Did We Get Here?
Recent years have seen huge progress in vision-language models (e.g. LLaVA 1.6 (Hermes 34B), DeepSeek-VL-Chat, Qwen-VL-Chat), which are trained on massive datasets from the web and excel at recognizing patterns in images and text. These models can handle everything from image captioning to visual question answering, even across many languages. But while these models are great at understanding the world, getting a physical robot to perform useful tasks is a different story—robots would need to gather hands-on data about every object, situation, and environment they might encounter. That’s a tall order.
VLA builds on this idea by combining the best of both worlds: the broad visual and semantic knowledge of web-trained models, and the hands-on, task-oriented experience of robots. Researchers usually take a vision-language model pre-trained on web-scale data and further train it on real-world robot demonstrations collected over months in some stable environment. This produces a vision-language-action (VLA) model that can translate general knowledge into instructions for robotic control—going beyond what the robot itself has directly experienced. Normally, vision-language models output sequences of words. In vision-language-action models output sequences of action tokens—essentially, instructions that a robot can carry out. Actions are encoded as strings of numbers, each representing things like whether to continue or stop, how to move the robotic arm, and how to operate the gripper.
VLA: From Research Prototypes to Real-World Robotics
While RT-2 marks a major step in bringing web-scale knowledge and hands-on robotic experience together, the field of Vision-Language-Action models is advancing rapidly and in several directions. Recent comprehensive surveys and open-source efforts highlight how VLAs are evolving from research prototypes into practical, general-purpose agents that can operate in real-world environments.
1. The Generalist Robot Revolution
Recent work, such as the π0 and π0-FAST models, is pushing towards “generalist” robotic agents—models that can learn across a wide range of tasks, robot types, and environments. Unlike specialized systems, these new models are pre-trained on diverse datasets and designed to adapt quickly to novel situations, much like large language models did for text.
2. Efficient and Accessible Robotics
Traditionally, high-performing VLA models have required enormous compute resources. SmolVLA and similar projects aim to democratize robot learning, showing that compact models can match or even exceed the performance of much larger systems, and can be trained or deployed on affordable, consumer-grade hardware.
3. Architectural Innovations: Better Action Representation
A critical area of VLA progress is in how models represent actions. Early approaches relied on simple tokenization, but new methods—like Frequency-space Action Sequence Tokenization (FAST)—compress and encode action trajectories far more efficiently, making it easier for models to control a wide range of robot types and speeds.
4. Cross-Embodiment Learning and Scalability
Modern VLA models face the challenge of learning from many different robot morphologies (single-arm, bimanual, mobile, etc.). By leveraging shared representations and cross-platform data, models like π0 generalize better and can transfer skills across diverse hardware.
The Road Ahead for Robotic Intelligence
Vision-language models can evolve into vision-language-action models capable of controlling robots directly by leveraging web-scale knowledge and real-world experience. The result: more robust, generalizable robots that can interpret and solve tasks in diverse, unfamiliar scenarios. It’s not just task performance, but the ability to generalize and acquire emergent skills. Ultimately, this approach points toward a future where general-purpose robots can reason, problem solve, and act flexibly in the real world.
Response
Riveting