Lilac - Better data, better AI is a data platform designed to search, quantify, and edit data for LLMs. It enables fast dataset computations, semantic and keyword search, PII and duplicate detection, language detection, custom signals, and fuzzy-concept search with refinement. Lilac emphasizes fast data processing, high-quality data selection, and democratized data sharing across organizations.
Overview
- Lilac provides tools for searching, quantifying, and editing datasets used for large language models (LLMs).
- It offers clustering, dataset embedding at high throughput, and rapid data transformations to accelerate data preparation pipelines.
- The platform is trusted by teams for data quality evaluation, dataset understanding, and task-specific data selection.
How to Use Lilac
- Install the Python package.
pip install lilac
- Access the Python user interface. Use the provided UI to interact with your datasets.
- Get started quickly. Follow guided workflows to search, cluster, and refine data for LLM tasks.
Key Capabilities
- Search and quantify data for LLMs
- Semantic & keyword search for precise data retrieval
- Edit & compare fields to reconcile dataset differences
- Detect PII, duplicates, language, or custom signals
- Fuzzy-concept search with refinement for nuanced data retrieval
- Blazing-fast dataset computations: cluster and title 1 million data points in 20 minutes
- High-throughput embeddings: embed your dataset at half a billion tokens per minute
- Accelerate your own data transformations with scalable processing
- Quick-start demos and documentation to onboard teams rapidly
Use Cases
- Data quality evaluation pipelines
- Dataset understanding and topic discovery
- Selecting the right data for a given AI task
- Democratizing datasets across the organization for broader collaboration
Testimonials
- Jonathan Ta lmi, Lead of Data Acquisition: “Lilac is an incredibly powerful tool for data exploration and quality control. We use Lilac daily to inspect and evaluate datasets, and then democratize them across the org. It is a critical part of our data quality evaluation pipeline.”
- Jonathan Frankle, Chief Neural Network Scientist: “Lilac provides a simple path to understanding the concepts in datasets and selecting the right data for a task.”
- NousResearch Teknium Co-founder: “Everyone working with LLM Datasets should check out @lilac_ai data platform…Their clustering helped determine a lot of topics Hermes-2.5 covers today.”
How It Works
- Install and configure Lilac in your environment.
- Ingest your datasets and run semantic/keyword searches to locate relevant data points.
- Use clustering and embedding to organize and compare data at scale.
- Apply edits and signals to refine your dataset, improving downstream LLM performance.
Safety and Legal Considerations
- Ensure proper handling of sensitive data (PII) and comply with your organization’s data governance policies when using Lilac.
Core Features
- Get Started: Quick installation and onboarding for rapid value
- Search, Quantify, and Edit Data for LLMs: End-to-end data preparation workflow
- Semantic & Keyword Search: Flexible retrieval across large datasets
- Edit & Compare Fields: Reconcile and harmonize data attributes
- PII, Duplicates, Language Detection, or Custom Signal: Robust data quality checks
- Fuzzy-Concept Search with Refinement: Nuanced data discovery
- Blazing-fast Dataset Computations: Cluster and title 1M data points in 20 minutes
- Embedding throughputs: Half a billion tokens per minute
- Accelerated Data Transformations: Scalable data processing pipelines
- Python Integration: Simple pip installation and Python UI for developers
Getting Started
- Install:
pip install lilac
- Launch: Access the Python UI and begin exploring datasets
- Learn More: Explore docs and demos to maximize data quality and task-specific data selection