Vision-Language-Action Model

A neural network architecture that unifies visual perception, language understanding, and motor control into a single model, enabling robots to process camera feeds and voice commands to generate physical movements.

Vision-language-action models (VLAs) represent an architectural shift in robotics, collapsing what were previously separate pipelines for perception, reasoning, and action into a single forward pass through a transformer-based network. A VLA takes camera input plus optional language instructions and outputs motor commands directly. Physical Intelligence's VLA models use 3-5 billion parameters to predict 50 robot steps in roughly 100 milliseconds, while NVIDIA's GR00T N1.6 is purpose-built for humanoid control.

Also known as