HomeImage Generation & EditingMeta Segment Anything Model 2

Meta Segment Anything Model 2 Product Information

Meta Segment Anything Model 2 (SAM 2) – Unified Video and Image Segmentation

SAM 2 is Meta’s next-generation unified model for segmenting objects across images and videos. It extends the original Segment Anything Model (SAM) with a per-session memory module that enables robust, real-time interactive segmentation in video, while preserving the simple, promptable design that works for both images and videos.

Key capabilities

  • Segment any object in images or video frames using a click, box, or mask as input. You can select one or multiple objects within a frame and refine predictions with additional prompts.
  • Strong zero-shot performance on objects, images, and videos not seen during training, enabling wide real-world applicability.
  • Real-time interactivity with streaming inference for interactive applications, including tracking objects across frames as the video plays.
  • State-of-the-art performance for both image and video segmentation, outperforming existing models in literature and benchmarks, especially for tracking parts of objects across frames.

How it works

  • Input prompts (click/box/mask) select targets on an image or a video frame.
  • For video, SAM 2 uses a per-session memory module that captures information about the target object, allowing tracking across subsequent frames, even if the object temporarily disappears.
  • Corrections can be made by providing extra prompts on any frame to refine the mask.
  • Streaming architecture processes video frames one at a time, enabling real-time results.

Model architecture highlights

  • Extends SAM’s promptable capability to the video domain with a memory module that maintains object context across frames.
  • When applied to images, the memory module is empty and SAM 2 behaves like the original SAM.
  • Video segmentation leverages a per-session memory to maintain tracking, plus a memory-enabled attention mechanism for robust predictions.

Data, training, and openness

  • Trained on a large and diverse video dataset (SA-V) created by interactive use of SAM 2 in a model-in-the-loop setup.
  • Dataset details: ~600k+ masklets across ~51k videos, across 47 countries, with annotations for whole objects, parts, and challenging occlusions.
  • Open access: pretrained SAM 2 model, the SA-V dataset, a demo, and code are publicly released to enable research and development.
  • Datasets and model releases emphasize transparency and geographic diversity, along with fairness considerations.

Applications and potential use cases

  • Fine-grained video editing, object tracking, and content creation.
  • Real-time object segmentation for AR/VR, video compositing, and post-production tools.
  • Use as input segmentation component for downstream AI systems, such as video generation or editing models.

Safety, ethics, and usage notes

  • The model is a research-oriented tool intended for segmentation tasks. Users should consider privacy, consent, and licensing when applying segmentation results to media featuring real people or proprietary content.

How to try

  • Try interactive segmentation by selecting an object in a video frame and tracking it across frames, with the option to refine masks via additional prompts.

Feature Highlights

  • Unified segmentation for images and videos in a single model
  • Real-time streaming inference enabling interactive video segmentation
  • Per-session memory to track objects across video frames
  • Robust zero-shot performance on unfamiliar objects and scenes
  • Interactive corrections with frame-wise prompts
  • Open releases: pretrained SAM 2 model, SA-V dataset, demo, and code
  • Extensive training data with broad geographic diversity