Allen Institute for AI has released Molmo 2, a new open multimodal model focused on precise spatial and temporal understanding across video, images, and multi-image inputs. The 8B-parameter model reportedly outperforms Ai2’s earlier 72B Molmo system on accuracy, video tracking, and pixel-level grounding, while also surpassing several proprietary models on emerging video reasoning tasks.
Molmo 2 was trained using a combination of synthetic and real-world data, relying on roughly 9 million videos – far fewer than comparable large-scale perception models. The system supports frame-level grounding, multi-object tracking, dense video captioning, and anomaly detection, enabling it to identify where events occur, when they happen, and how objects persist across complex scenes.
The release highlights growing interest in data-efficient multimodal models for robotics and automation. By publishing open weights, datasets, and evaluation tools, Ai2 is positioning Molmo 2 as a transparent foundation for building real-world systems that depend on reliable visual understanding rather than closed, proprietary pipelines.