Meta Segment Anything Model 2 (SAM 2) – Unified Video and Image Segmentation

SAM 2 is Meta’s next-generation unified model for segmenting objects across images and videos. It extends the original Segment Anything Model (SAM) with a per-session memory module that enables robust, real-time interactive segmentation in video, while preserving the simple, promptable design that works for both images and videos.

Key capabilities

Segment any object in images or video frames using a click, box, or mask as input. You can select one or multiple objects within a frame and refine predictions with additional prompts.
Strong zero-shot performance on objects, images, and videos not seen during training, enabling wide real-world applicability.
Real-time interactivity with streaming inference for interactive applications, including tracking objects across frames as the video plays.
State-of-the-art performance for both image and video segmentation, outperforming existing models in literature and benchmarks, especially for tracking parts of objects across frames.

How it works

Input prompts (click/box/mask) select targets on an image or a video frame.
For video, SAM 2 uses a per-session memory module that captures information about the target object, allowing tracking across subsequent frames, even if the object temporarily disappears.
Corrections can be made by providing extra prompts on any frame to refine the mask.
Streaming architecture processes video frames one at a time, enabling real-time results.

Model architecture highlights

Extends SAM’s promptable capability to the video domain with a memory module that maintains object context across frames.
When applied to images, the memory module is empty and SAM 2 behaves like the original SAM.
Video segmentation leverages a per-session memory to maintain tracking, plus a memory-enabled attention mechanism for robust predictions.

Data, training, and openness

Trained on a large and diverse video dataset (SA-V) created by interactive use of SAM 2 in a model-in-the-loop setup.
Dataset details: ~600k+ masklets across ~51k videos, across 47 countries, with annotations for whole objects, parts, and challenging occlusions.
Open access: pretrained SAM 2 model, the SA-V dataset, a demo, and code are publicly released to enable research and development.
Datasets and model releases emphasize transparency and geographic diversity, along with fairness considerations.

Applications and potential use cases

Fine-grained video editing, object tracking, and content creation.
Real-time object segmentation for AR/VR, video compositing, and post-production tools.
Use as input segmentation component for downstream AI systems, such as video generation or editing models.

Safety, ethics, and usage notes

The model is a research-oriented tool intended for segmentation tasks. Users should consider privacy, consent, and licensing when applying segmentation results to media featuring real people or proprietary content.

How to try

Try interactive segmentation by selecting an object in a video frame and tracking it across frames, with the option to refine masks via additional prompts.

Feature Highlights

Unified segmentation for images and videos in a single model
Real-time streaming inference enabling interactive video segmentation
Per-session memory to track objects across video frames
Robust zero-shot performance on unfamiliar objects and scenes
Interactive corrections with frame-wise prompts
Open releases: pretrained SAM 2 model, SA-V dataset, demo, and code
Extensive training data with broad geographic diversity

Meta Segment Anything Model 2

Introduction

Tags

Featured

n8n

Chatbase

Lovable

Hailuo AI

Meta Segment Anything Model 2 Product Information