Xiaomi has released Robotics-0, an open-source Vision-Language-Action (VLA) model intended to power real-time robotic control and perception, underscoring the company’s ambitions beyond consumer electronics and into the foundation layers of physical AI.
The model, made publicly available with pretrained weights and code, integrates multimodal understanding with continuous action generation – an approach increasingly viewed as essential for robots that operate in dynamic, unstructured environments. Unlike models tailored solely for perception or language tasks, Robotics-0 is designed to connect what a robot sees and hears with what it does next.
By open-sourcing the technology, Xiaomi signals its intent to shape the broader research and development ecosystem for generalist robotics, potentially accelerating adoption and experimentation across academia and industry.
Bridging Perception, Language, and Action
At its core, Robotics-0 blends a pretrained Vision-Language Model (VLM) with a Diffusion Transformer (DiT) that generates continuous action sequences conditioned on both visual inputs and language instructions. The combined architecture allows the system to interpret images and textual directives, then produce executable robot control commands in real time.
Xiaomi reports that the model contains 4.7 billion parameters and was trained on a massive dataset that includes approximately 200 million robot trajectory steps alongside over 80 million samples of general vision-language data. The robot data spans tasks collected via teleoperation, including complex bimanual manipulation scenarios such as Lego disassembly and towel folding.
This dual-domain training approach is designed to preserve strong vision-language capabilities while enabling generalizable action generation – allowing the model to interpret instructions and translate them into motion without requiring extensive task-specific retraining.
A central challenge in combining perception and control is inference latency. Robotics-0 addresses this with asynchronous execution, where the robot begins executing one chunk of actions while the system prepares the next. This padding strategy ensures that physical motion flows continuously even under real-time constraints.
Early benchmarks suggest strong performance across simulation environments, with high success rates on standard tasks and competitive results on multimodal benchmarks. According to Xiaomi, the model approaches the underlying VLM’s performance on language and visual understanding metrics while delivering robust action generation in simulation.
Implications for Robotics Development
The open-source release comes at a pivotal moment in robotics. Across academia and industry, there is converging interest in foundation models that can generalize across tasks without retraining from scratch for each new scenario. By providing a publicly accessible, pretrained VLA model, Xiaomi is lowering barriers to experimentation and potentially influencing how next-generation embodied AI systems are built.
For robotics developers, Robotics-0 offers several strategic advantages:
- Multimodal integration: The ability to combine vision, language, and action generation facilitates more intuitive human-robot interaction workflows, as instructions can be delivered in natural language and grounded in visual context.
- Scalability: Pretrained weights and public code allow researchers and integrators to start from a shared baseline rather than building models from scratch.
- Real-time execution: By tackling the latency problem head-on through asynchronous action generation, Xiaomi moves toward practical deployment on physical platforms rather than pure simulation.
Open-source foundation models have accelerated progress in fields such as natural language processing and computer vision. Applying the same principle to embodied AI suggests a roadmap where robots can share knowledge structures and learning strategies across domains rather than isolated task solutions.
However, moving from benchmark performance to robust, commercialized robotics applications remains a substantive challenge. Physical systems must contend with real-world variability – from lighting conditions and sensor noise to mechanical wear and safety constraints – that remains difficult to capture in simulated training.
By releasing Robotics-0, Xiaomi invites the global robotics community to contribute toward refining and extending the model’s capabilities. Partnerships between researchers, integrators, and hardware manufacturers could accelerate the pace at which vision-language-action models power real-world robots.
Whether this open-source push will translate into commercial leadership depends on broader ecosystem adoption and how effectively developers leverage the model across diverse robotic platforms. But in a field where software has often lagged hardware advances, Xiaomi’s move marks an important effort to define the software stack for a new generation of embodied intelligence.