NVIDIA has introduced Cosmos Policy, a new AI model designed to improve how robots plan and execute physical actions. The system builds on the company’s Cosmos world foundation models, representing a shift toward unified AI architectures capable of directly controlling robotic systems rather than simply interpreting environments.
The announcement reflects a broader trend in robotics: the convergence of perception, reasoning, and action within a single AI model. As robots move into more complex real-world environments, fragmented software pipelines are giving way to integrated systems that can understand and respond to physical dynamics in real time.
A Unified Model for Perception and Action
At its core, Cosmos Policy addresses one of robotics’ most persistent challenges: translating perception into precise physical movement. Traditionally, robotic systems rely on separate modules for vision, planning, and motor control, each trained independently and connected through complex software pipelines.
Cosmos Policy simplifies this architecture by embedding perception, action, and planning within a single model. The system builds on Cosmos Predict, a world foundation model trained to anticipate how physical environments evolve over time.
Instead of treating robot commands as separate outputs, Cosmos Policy encodes actions and environmental changes as part of the same predictive framework used for video generation. This allows the model to generate sequences of robot actions while simultaneously predicting how those actions will affect the environment.
The result is a system that can perform multiple functions within one architecture, including generating movement commands, predicting future physical states, and evaluating potential outcomes. This integrated approach allows robots to anticipate consequences before executing actions, improving planning reliability.
Improving Manipulation and Real-World Performance
Early results suggest the unified architecture improves performance in robotic manipulation tasks. NVIDIA evaluated Cosmos Policy on established robotics benchmarks, including LIBERO and RoboCasa, which test multi-step object handling and household task scenarios.
The model demonstrated higher success rates than baseline approaches trained without world foundation model pretraining. In real-world experiments using the ALOHA robot platform, Cosmos Policy was able to complete complex manipulation tasks using visual input alone.
One key advantage lies in the model’s ability to leverage temporal understanding learned during video training. By learning how physical scenes evolve over time, the system develops a form of predictive intuition about motion and interaction.
This capability is particularly important for manipulation tasks requiring precise coordination, such as grasping, moving, and placing objects in dynamic environments.
The model can also operate in two modes. In direct control mode, it generates actions immediately. In planning mode, it evaluates multiple possible action sequences and selects the most effective path based on predicted outcomes.
A Shift Toward World Foundation Models for Robotics
Cosmos Policy reflects NVIDIA’s broader strategy to apply world foundation models to robotics and physical AI. Unlike traditional AI systems trained primarily on static images or text, world foundation models are trained on video data to understand how physical systems evolve over time.
This temporal modeling capability is essential for robotics, where actions and consequences unfold dynamically.
The approach contrasts with vision-language models commonly used in robotics research. While those models can interpret scenes and suggest actions, they often lack the detailed physical understanding required for precise motor control.
World foundation models, by contrast, are trained to predict motion and interaction directly, making them better suited for controlling physical systems.
As embodied AI continues to evolve, unified models like Cosmos Policy could simplify robotics software stacks, reduce training complexity, and improve generalization across tasks.
NVIDIA’s latest release highlights a broader shift in robotics architecture. Rather than assembling separate perception, planning, and control systems, the industry is moving toward integrated AI models capable of understanding and acting within the physical world.
If successful, this approach could accelerate the development of robots capable of operating reliably in unstructured environments, bringing physical AI closer to large-scale commercial deployment.