Lilac Product Information

Lilac - Better data, better AI is a data platform designed to search, quantify, and edit data for LLMs. It enables fast dataset computations, semantic and keyword search, PII and duplicate detection, language detection, custom signals, and fuzzy-concept search with refinement. Lilac emphasizes fast data processing, high-quality data selection, and democratized data sharing across organizations.


Overview

  • Lilac provides tools for searching, quantifying, and editing datasets used for large language models (LLMs).
  • It offers clustering, dataset embedding at high throughput, and rapid data transformations to accelerate data preparation pipelines.
  • The platform is trusted by teams for data quality evaluation, dataset understanding, and task-specific data selection.

How to Use Lilac

  1. Install the Python package. pip install lilac
  2. Access the Python user interface. Use the provided UI to interact with your datasets.
  3. Get started quickly. Follow guided workflows to search, cluster, and refine data for LLM tasks.

Key Capabilities

  • Search and quantify data for LLMs
  • Semantic & keyword search for precise data retrieval
  • Edit & compare fields to reconcile dataset differences
  • Detect PII, duplicates, language, or custom signals
  • Fuzzy-concept search with refinement for nuanced data retrieval
  • Blazing-fast dataset computations: cluster and title 1 million data points in 20 minutes
  • High-throughput embeddings: embed your dataset at half a billion tokens per minute
  • Accelerate your own data transformations with scalable processing
  • Quick-start demos and documentation to onboard teams rapidly

Use Cases

  • Data quality evaluation pipelines
  • Dataset understanding and topic discovery
  • Selecting the right data for a given AI task
  • Democratizing datasets across the organization for broader collaboration

Testimonials

  • Jonathan Ta lmi, Lead of Data Acquisition: “Lilac is an incredibly powerful tool for data exploration and quality control. We use Lilac daily to inspect and evaluate datasets, and then democratize them across the org. It is a critical part of our data quality evaluation pipeline.”
  • Jonathan Frankle, Chief Neural Network Scientist: “Lilac provides a simple path to understanding the concepts in datasets and selecting the right data for a task.”
  • NousResearch Teknium Co-founder: “Everyone working with LLM Datasets should check out @lilac_ai data platform…Their clustering helped determine a lot of topics Hermes-2.5 covers today.”

How It Works

  • Install and configure Lilac in your environment.
  • Ingest your datasets and run semantic/keyword searches to locate relevant data points.
  • Use clustering and embedding to organize and compare data at scale.
  • Apply edits and signals to refine your dataset, improving downstream LLM performance.

Safety and Legal Considerations

  • Ensure proper handling of sensitive data (PII) and comply with your organization’s data governance policies when using Lilac.

Core Features

  • Get Started: Quick installation and onboarding for rapid value
  • Search, Quantify, and Edit Data for LLMs: End-to-end data preparation workflow
  • Semantic & Keyword Search: Flexible retrieval across large datasets
  • Edit & Compare Fields: Reconcile and harmonize data attributes
  • PII, Duplicates, Language Detection, or Custom Signal: Robust data quality checks
  • Fuzzy-Concept Search with Refinement: Nuanced data discovery
  • Blazing-fast Dataset Computations: Cluster and title 1M data points in 20 minutes
  • Embedding throughputs: Half a billion tokens per minute
  • Accelerated Data Transformations: Scalable data processing pipelines
  • Python Integration: Simple pip installation and Python UI for developers

Getting Started

  • Install: pip install lilac
  • Launch: Access the Python UI and begin exploring datasets
  • Learn More: Explore docs and demos to maximize data quality and task-specific data selection