Vision-language-action model

The general architecture of a vision-language-action model. The model receives as input a text instruction and an image observation that are encoded in a latent representation. The action decoder receives this representation and generates a sequence of low-level robot actions.

In robot learning, a vision-language-action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions. Given an input image (or video) of the robot's surroundings and a text instruction, a VLA directly outputs low-level robot actions that can be executed to accomplish the requested task.[1]

VLAs are generally constructed by fine-tuning a vision-language model (VLM, i.e. a large language model extended with vision capabilities) on a large-scale dataset that pairs visual observation and language instructions with robot trajectories.[2] These models combine a vision-language encoder (typically a VLM or a vision transformer), which translates an image observation and a natural language description into a distribution within a latent space, with an action decoder that transforms this representation into continuous output actions, directly executable on the robot.[3]

The concept was pioneered in July 2023 by Google DeepMind with RT-2, a VLM adapted for end-to-end manipulation tasks, capable of unifying perception, reasoning and control.[4]

Recent examples of VLAs include π0 by Physical Intelligence[5] and OpenVLA[6].

Overview of architecture

[edit]

VLAs share a common high-level architecture articulated in two stages.

In the first stage, a pre-trained VLM serves as the perception and reasoning core. It encodes one or more camera images together with a language instruction into a sequence of language tokens in a shared latent space. VLMs are specifically trained on large multimodal datasets and can perform a variety of tasks such as image understanding, visual-question answering and reasoning. In order to directly control robots, VLMs must be extended to output robot actions.[7]

In the second stage, an action decoder maps those tokens to discrete symbols that are then de-tokenised into continuous robot commands. These output actions are represented in the same way as language tokens, but specifically refer to the number of degrees of freedom (DoF) of the robot's end effector. Considering a 6-DoF end-effector, the action space usually includes end-effector displacements (positional and rotational) and gripper positions. For instance, in RT-2, each action vector covers 6-DoF in addition to the gripper state and a termination flag, all quantized into 256 bins.[2]

VLAs usually rely on off-the-shelf VLMs, giving the robot a prior understanding of images and text. During the training process, the model is then fine-tuned on data in the form of (text instruction, visual observation, action trajectory), and so it learns to map visual observations and text instructions to robot actions. The training dataset consists of robot demonstrations which may be gathered from real robots, human teleoperation, or even synthetically generated in a simulation environment. Due to end-to-end learning, VLAs inherently learn to associate high-level concepts (e.g. object categories and spatial relations) with low-level actions, eliminating the partitioning typical of traditional robotic systems.[2][8]

Action representation

[edit]

A crucial design choice for the architecture of a VLA is the format in which robot actions are encoded.

Discrete Token Output is the most common approach, used by VLAs such as RT-2 and OpenVLA, and it represents each motion primitive as a sequence of discrete tokens. In this way, the model encodes the robot actions as an action string, and the VLA model learns to generate these sequences just as a language model generates text. This token-based approach keeps the same output layer and makes training straightforward. However, converting continuous trajectories into vocabulary symbols can limit spatial accuracy or temporal resolution. RT-2 demonstrates that this can be mitigated using special tokens that, for instance, mark the end of an action segment.[2][6]

Continuous Output (Diffusion/Flow) is an alternative approach used by VLAs such as π0 that, in order to achieve accurate dexterity and high frequency control, forego discrete tokens and directly output continuous actions. This is achieved through the use of diffusion models or flow-matching networks that act as the action decoder. π0 exploited this strategy to output continuous joint trajectories up to 50Hz. Practically, continuous output tends to scale better to robots with many degrees of freedom, where discretization for every DoF would be impractical.[9]

Single-model versus dual-system design

[edit]
Comparison between single and dual-system architecture in a vision-language-action model. Single system VLA (top) is an end-to-end architecture that couples a pre-trained VLM with an action decoder. This model handles text, images, and robot state and output actions. Dual-system VLA (bottom) is a modular architecture in which the pre-trained VLM and the action decoder are two separate subsystems. They communicate through a shared latent space. Note that each system can run independently, even on different GPUs.

VLAs can be organized either as a single end-to-end network or as a dual-system that employs two coupled models.

The single-model design, employed by RT-2, OpenVLA and π0, simultaneously understands the scene and the language instruction to produce robot actions in a single forward pass, keeping the architecture simple and reducing latency.[2][6][9]

The dual-system design, adopted by Helix and Groot N1, decouples the architecture into two components. The first component is usually slower and handles image observation and text instructions received as input. The second component runs at a faster rate and produces the robot's actions. The two components are trained end-to-end to communicate. This split improves dexterity and latency at the cost of increased computational complexity.[10][11]

History

[edit]

2023

[edit]

Robotic Transformer 2 (RT-2)

[edit]

Robotic Transformer 2 (RT-2) was developed by Google DeepMind in mid-2023 and established the vision-language-action model paradigm in robotics. It builds on two state-of-the-art VLMs, respectively PaLI-X[12] and PaLM-E[13], by fine-tuning them on real robot demonstration data. RT-2 takes as input camera images paired with a text description and outputs discretized robot action encoded as discrete tokens. Compared to its predecessor RT-1[14], which was trained only on robotic data, RT-2 exhibits stronger generalization for new tasks, being also able to perform multi-step reasoning using chain-of-thought.[4]

2024

[edit]

OpenVLA

[edit]
OpenVLA model architecture. Starting from an image observation and a natural language description of a task, the system generates 7D robot actions.[6]

OpenVLA is a 7b-parameter open-source VLA model introduced in June 2024 by researchers at Stanford. It was trained on the Open X-Embodiment dataset, a collaboration between 21 institutions that collected over one million episodes on 22 different embodiments. The model fuses image features using DINOv2[15] and CLIP, with a Llama-2 language backbone, and outputs discrete actions tokens. Despite its smaller size with respect to Google DeepMind’s RT-2, OpenVLA outperforms RT-2 on a suite of manipulation tasks. It also supports parameter-efficient fine-tuning methods and quantization for resource-constrained deployment.[6][16][17]

Octo (Open Generalist Policy)

[edit]

Octo is a lightweight open-source generalist robot policy from UC Berkeley. Originally trained on Open X-Embodiment, it was released in smaller configurations (27M and 93M parameters). Octo encodes text instructions and image observations respectively with a language model and a lightweight convolutional neural network. Additionally, instead of an autoregressive decoder, Octo uses a diffusion policy that outputs continuous joint trajectories, enabling smoother motion and fast task adaptation. During fine-tuning, the block-wise attention structure of the architecture employed by Octo, allows to add new observations without modifying the parameters.[18]

TinyVLA

[edit]

TinyVLA is a compact VLA designed for fast inference and efficient training. TinyVLA addresses the computational requirements and the heavy reliance on large datasets of its predecessors by initializing the policy with a smaller multimodal backbone and then fine-tuning on robotics data. This work demonstrated potential for more efficient VLAs, focusing on architecture and data curation without the computational cost of very large models.[19]

π0 (pi-zero)

[edit]

π0 (pi-zero) is a large-scale generalist VLA, announced in late 2024 by the startup Physical Intelligence[9]. π0 incorporates Paligemma[20] as a pre-trained VLM backbone, built from SigLIP[21] and Gemma[22] encoders, with an action expert trained on robot trajectories from Open X-Embodiment. Trained on robot trajectories from 8 different embodiments, it is able to generalize cross-embodiment, control different robotic arms (single-arm, dual-arm) and tackle a wide variety of tasks. π0 also introduced flow-matching model to generate high-frequency continuous actions, up to 50Hz, while the action head takes advantage of a diffusion policy.[23][24] π0-FAST, an extension of π0, takes advantage of Frequency-space Action Sequence Tokenization (FAST)[25], a novel time-series compression approach that transform continuous tokens from time domain to frequency domain using discrete cosine transform.

2025

[edit]

Helix

[edit]

Helix, unveiled in February 2025 by Figure AI, it is a generalist VLA specifically tailored for humanoid robots. It is the first VLA able to control at a high frequency the entire upper body of a humanoid (i.e. arms, hands, torso, head, fingers). It uses a dual-system architecture, with two complementary systems trained to communicate in an end-to-end manner. System 2 (S2) is an internet-scale VLM specialized in scene understanding and language comprehension, while System 1 (S1) is a visuomotor policy that translates the latent representations produced by S2 into continuous robot actions. This decoupled architecture allows to achieve both broad generalization and fast low-level control. Helix is trained on ~500 hours of robot teleoperation paired with automatically generated text descriptions. The Helix model underscored the ability of VLAs to scale to complex embodiments such as humanoids.[10]

GR00T N1

[edit]

GR00T N1, released by NVIDIA in March 2025, is a VLA for humanoid robots that adopts the same dual-system architecture employed by Helix. It is composed of a System 2, a VLM responsible for the perception of the environment, and a System 1, which generates motor action. Different from other VLAs, it includes a heterogeneous mixture of data comprising robots' trajectories, human videos and synthetic datasets.[11]

Gemini Robotics

[edit]

Gemini Robotics, introduced in 2025 by Google DeepMind, is a VLA that builds on top of the capabilities of Gemini 2.0. While Gemini is inherently able to process multimodal data such as text, images, videos and audio, Gemini Robotics extends these capabilities to the physical world, allowing robots to take actions. The reasoning capabilities of the Gemini 2.0 VLM backbone, paired with learned low-level robot actions, allow the robot to perform highly dexterous tasks such as folding origami, as well as playing with cards. The model exhibits a high degree of generalization and is able to adapt to entirely new platforms. In June 2025, the authors released Gemini Robotics On-Device, a lightweight version of the previous model, optimized to run locally on a real robot with low-latency and high reliability while preserving dexterity.[8][26]

SmolVLA

[edit]

SmolVLA is an open-source compact VLA with 450 million parameters released by Hugging Face. It represents an effort to democratize research on VLAs. It was trained entirely on LeRobot, an open-source dataset collected and curated by the community. Despite its compact size, SmolVLA achieved comparable performances with much larger VLAs such as Octo, OpenVLA and π0. The architecture of SmolVLA employs flow-matching for continuous control, and asynchronous inference to decouple the VLM backbone from the action execution. SmolVLA can be fine-tuned and used on a single consumer GPU.[27][28][29]

See also

[edit]

References

[edit]
  1. ^ Jeong, Hyeongyo; Lee, Haechan; Kim, Changwon; Shin, Sungta (October 2024). "A Survey of Robot Intelligence with Large Language Models". Applied Sciences. 14 (19): 8868. doi:10.3390/app14198868.
  2. ^ a b c d e Brohan, Anthony; Brown, Noah; Carbajal, Justice; Chebotar, Yevgen; Chen, Xi; Choromanski, Krzysztof; Ding, Tianli; Driess, Danny; Dubey, Avinava (July 28, 2023), RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, Proceedings of The 7th Conference on Robot Learning: PLMR, pp. 2165–2183, doi:10.48550/arXiv.2307.15818, arXiv:2307.15818, retrieved July 10, 2025{{citation}}: CS1 maint: location (link)
  3. ^ Fan, L.; Chen, Z.; Xu, M.; Yuan, M.; Huang, P.; Huang, W. (2024). "Language Reasoning in Vision-Language-Action Model for Robotic Grasping". 2024 China Automation Congress (CAC). pp. 6656–6661. doi:10.1109/CAC63892.2024.10865585. ISBN 979-8-3503-6860-4.
  4. ^ a b Dotson, Kyt (July 28, 2023). "Google unveils RT-2, an AI language model for telling robots what to do". Silicon Angle. Retrieved March 13, 2025.
  5. ^ "Our First Generalist Policy". physicalintelligence.company. October 31, 2024. Retrieved July 9, 2025.
  6. ^ a b c d e Kim, Moo Jin; Pertsch, Karl; Karamcheti, Siddharth; Xiao, Ted; Balakrishna, Ashwin; Nair, Suraj; Rafailov, Rafael; Foster, Ethan; Lam, Grace (September 5, 2024), OpenVLA: An Open-Source Vision-Language-Action Model, 8th Annual Conference on Robot Learning, doi:10.48550/arXiv.2406.09246, arXiv:2406.09246, retrieved July 8, 2025{{citation}}: CS1 maint: location (link) CS1 maint: location missing publisher (link)
  7. ^ Zhang, Jingyi; Huang, Jiaxing; Jin, Sheng; Lu, Shijian (August 2024). "Vision-Language Models for Vision Tasks: A Survey". IEEE Transactions on Pattern Analysis and Machine Intelligence. 46 (8): 5625–5644. doi:10.1109/TPAMI.2024.3369699. ISSN 0162-8828.
  8. ^ a b Team, Gemini Robotics; Abeyruwan, Saminda; Ainslie, Joshua; Alayrac, Jean-Baptiste; Arenas, Montserrat Gonzalez; Armstrong, Travis; Balakrishna, Ashwin; Baruch, Robert; Bauza, Maria (March 25, 2025), Gemini Robotics: Bringing AI into the Physical World, arXiv, doi:10.48550/arXiv.2503.20020, arXiv:2503.20020, retrieved July 9, 2025
  9. ^ a b c Black, Kevin; Brown, Noah; Driess, Danny; Esmail, Adnan; Equi, Michael; Finn, Chelsea; Fusai, Niccolo; Groom, Lachy; Hausman, Karol (November 13, 2024), $π_0$: A Vision-Language-Action Flow Model for General Robot Control, arXiv, doi:10.48550/arXiv.2410.24164, arXiv:2410.24164, retrieved July 10, 2025
  10. ^ a b "Helix: A Vision-Language-Action Model for Generalist Humanoid Control". FigureAI. February 20, 2025. Retrieved July 9, 2025.
  11. ^ a b NVIDIA; Bjorck, Johan; Castañeda, Fernando; Cherniadev, Nikita; Da, Xingye; Ding, Runyu; Fan, Linxi "Jim"; Fang, Yu; Fox, Dieter (March 27, 2025), GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, arXiv, doi:10.48550/arXiv.2503.14734, arXiv:2503.14734, retrieved July 9, 2025
  12. ^ Chen, Xi (2024). "On Scaling Up a Multilingual Vision and Language Model". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 14432–14444. arXiv:2305.18565v1 – via OpenAccess.
  13. ^ Driess, Danny (July 23, 2023). "PaLM-E: an embodied multimodal language model". Proceedings of the 40th International Conference on Machine Learning (ICML). 340: 8469–8488. arXiv:2303.03378v1 – via ACM DL.
  14. ^ Brohan, Anthony; Brown, Noah; Carbajal, Justice; Chebotar, Yevgen; Dabis, Joseph; Finn, Chelsea; Gopalakrishnan, Keerthana; Hausman, Karol; Herzog, Alex (August 11, 2023), RT-1: Robotics Transformer for Real-World Control at Scale, arXiv, doi:10.48550/arXiv.2212.06817, arXiv:2212.06817, retrieved July 8, 2025
  15. ^ Oquab, Maxime; Darcet, Timothée; Moutakanni, Théo; Vo, Huy; Szafraniec, Marc; Khalidov, Vasil; Fernandez, Pierre; Haziza, Daniel; Massa, Francisco (February 2, 2024), DINOv2: Learning Robust Visual Features without Supervision, Transactions on Machine Learning Research Journal: arXiv, doi:10.48550/arXiv.2304.07193, arXiv:2304.07193, retrieved July 8, 2025
  16. ^ Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela (February 26, 2021), Learning Transferable Visual Models From Natural Language Supervision, Proceedings of the 38th International Conference on Machine Learning: arXiv, doi:10.48550/arXiv.2103.00020, arXiv:2103.00020, retrieved July 8, 2025{{citation}}: CS1 maint: location (link)
  17. ^ O’Neill, Abby; Rehman, Abdul; Maddukuri, Abhiram; Gupta, Abhishek; Padalkar, Abhishek; Lee, Abraham; Pooley, Acorn; Gupta, Agrim; Mandlekar, Ajay; Jain, Ajinkya; Tung, Albert; Bewley, Alex; Herzog, Alex; Irpan, Alex; Khazatsky, Alexander (May 2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models : Open X-Embodiment Collaboration0". 2024 IEEE International Conference on Robotics and Automation (ICRA): 6892–6903. doi:10.1109/ICRA57147.2024.10611477.
  18. ^ Team, Octo Model; Ghosh, Dibya; Walke, Homer; Pertsch, Karl; Black, Kevin; Mees, Oier; Dasari, Sudeep; Hejna, Joey; Kreiman, Tobias (May 26, 2024), Octo: An Open-Source Generalist Robot Policy, arXiv, doi:10.48550/arXiv.2405.12213, arXiv:2405.12213, retrieved July 8, 2025
  19. ^ Wen, Junjie; Zhu, Yichen; Li, Jinming; Zhu, Minjie; Tang, Zhibin; Wu, Kun; Xu, Zhiyuan; Liu, Ning; Cheng, Ran; Shen, Chaomin; Peng, Yaxin; Feng, Feifei; Tang, Jian (April 2025). "TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation". IEEE Robotics and Automation Letters. 10 (4): 3988–3995. doi:10.1109/LRA.2025.3544909. ISSN 2377-3766.
  20. ^ Beyer, Lucas; Steiner, Andreas; Pinto, André Susano; Kolesnikov, Alexander; Wang, Xiao; Salz, Daniel; Neumann, Maxim; Alabdulmohsin, Ibrahim; Tschannen, Michael (October 10, 2024), PaliGemma: A versatile 3B VLM for transfer, arXiv, doi:10.48550/arXiv.2407.07726, arXiv:2407.07726, retrieved July 10, 2025
  21. ^ Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas (October 1, 2023). "Sigmoid Loss for Language Image Pre-Training". 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE: 11941–11952. doi:10.1109/ICCV51070.2023.01100. ISBN 979-8-3503-0718-4.
  22. ^ Team, Gemma; Mesnard, Thomas; Hardin, Cassidy; Dadashi, Robert; Bhupatiraju, Surya; Pathak, Shreya; Sifre, Laurent; Rivière, Morgane; Kale, Mihir Sanjay (April 16, 2024), Gemma: Open Models Based on Gemini Research and Technology, arXiv, doi:10.48550/arXiv.2403.08295, arXiv:2403.08295, retrieved July 10, 2025
  23. ^ Beyer, Lucas; Steiner, Andreas; Pinto, André Susano; Kolesnikov, Alexander; Wang, Xiao; Salz, Daniel; Neumann, Maxim; Alabdulmohsin, Ibrahim; Tschannen, Michael (October 10, 2024), PaliGemma: A versatile 3B VLM for transfer, arXiv, doi:10.48550/arXiv.2407.07726, arXiv:2407.07726, retrieved July 9, 2025
  24. ^ Black, Kevin; Brown, Noah; Driess, Danny; Esmail, Adnan; Equi, Michael; Finn, Chelsea; Fusai, Niccolo; Groom, Lachy; Hausman, Karol (2024), $π_0$: A Vision-Language-Action Flow Model for General Robot Control, arXiv, doi:10.48550/ARXIV.2410.24164, retrieved July 9, 2025
  25. ^ Pertsch, Karl; Stachowicz, Kyle; Ichter, Brian; Driess, Danny; Nair, Suraj; Vuong, Quan; Mees, Oier; Finn, Chelsea; Levine, Sergey (January 16, 2025), FAST: Efficient Action Tokenization for Vision-Language-Action Models, arXiv, doi:10.48550/arXiv.2501.09747, arXiv:2501.09747, retrieved July 10, 2025
  26. ^ "Gemini Robotics". Google DeepMind. Retrieved July 9, 2025.
  27. ^ "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". huggingface.co. June 3, 2025. Retrieved July 9, 2025.
  28. ^ "lerobot (LeRobot)". huggingface.co. June 24, 2025. Retrieved July 9, 2025.
  29. ^ Shukor, Mustafa; Aubakirova, Dana; Capuano, Francesco; Kooijmans, Pepijn; Palma, Steven; Zouitine, Adil; Aractingi, Michel; Pascal, Caroline; Russi, Martino (June 2, 2025), SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics, arXiv, doi:10.48550/arXiv.2506.01844, arXiv:2506.01844, retrieved July 9, 2025

Further reading

[edit]