Meta Segment Anything Model 2 (SAM 2) – Unified Video and Image Segmentation
SAM 2 is Meta’s next-generation unified model for segmenting objects across images and videos. It extends the original Segment Anything Model (SAM) with a per-session memory module that enables robust, real-time interactive segmentation in video, while preserving the simple, promptable design that works for both images and videos.
Key capabilities
- Segment any object in images or video frames using a click, box, or mask as input. You can select one or multiple objects within a frame and refine predictions with additional prompts.
- Strong zero-shot performance on objects, images, and videos not seen during training, enabling wide real-world applicability.
- Real-time interactivity with streaming inference for interactive applications, including tracking objects across frames as the video plays.
- State-of-the-art performance for both image and video segmentation, outperforming existing models in literature and benchmarks, especially for tracking parts of objects across frames.
How it works
- Input prompts (click/box/mask) select targets on an image or a video frame.
- For video, SAM 2 uses a per-session memory module that captures information about the target object, allowing tracking across subsequent frames, even if the object temporarily disappears.
- Corrections can be made by providing extra prompts on any frame to refine the mask.
- Streaming architecture processes video frames one at a time, enabling real-time results.
Model architecture highlights
- Extends SAM’s promptable capability to the video domain with a memory module that maintains object context across frames.
- When applied to images, the memory module is empty and SAM 2 behaves like the original SAM.
- Video segmentation leverages a per-session memory to maintain tracking, plus a memory-enabled attention mechanism for robust predictions.
Data, training, and openness
- Trained on a large and diverse video dataset (SA-V) created by interactive use of SAM 2 in a model-in-the-loop setup.
- Dataset details: ~600k+ masklets across ~51k videos, across 47 countries, with annotations for whole objects, parts, and challenging occlusions.
- Open access: pretrained SAM 2 model, the SA-V dataset, a demo, and code are publicly released to enable research and development.
- Datasets and model releases emphasize transparency and geographic diversity, along with fairness considerations.
Applications and potential use cases
- Fine-grained video editing, object tracking, and content creation.
- Real-time object segmentation for AR/VR, video compositing, and post-production tools.
- Use as input segmentation component for downstream AI systems, such as video generation or editing models.
Safety, ethics, and usage notes
- The model is a research-oriented tool intended for segmentation tasks. Users should consider privacy, consent, and licensing when applying segmentation results to media featuring real people or proprietary content.
How to try
- Try interactive segmentation by selecting an object in a video frame and tracking it across frames, with the option to refine masks via additional prompts.
Feature Highlights
- Unified segmentation for images and videos in a single model
- Real-time streaming inference enabling interactive video segmentation
- Per-session memory to track objects across video frames
- Robust zero-shot performance on unfamiliar objects and scenes
- Interactive corrections with frame-wise prompts
- Open releases: pretrained SAM 2 model, SA-V dataset, demo, and code
- Extensive training data with broad geographic diversity