ChatTTS: Text-to-Speech For Chat is a voice generation model optimized for conversational scenarios, designed to provide natural, fluid speech for dialogue tasks typical of large language model (LLM) assistants and applications such as conversational audio and video introductions. It supports English and Chinese and is trained on a large, diverse dataset to deliver high-quality, natural-sounding speech. The project emphasizes ease of use, multi-language support, and potential openness through open-source baselines.

How ChatTTS Works

Multi-language support (English and Chinese) enables usage across diverse audiences.
Trained on approximately 100,000 hours of Chinese and English data to achieve natural speech quality.
Tailored for dialog tasks, providing coherent, context-aware voice responses in conversations.
Plans to open-source a base model trained on 40,000 hours of data to foster research and collaboration.
Focus on controllability, including watermarks and integration with LLMs for safe, reliable performance.
Simple input: text only, which is converted into speech files.

How to Use ChatTTS

Download from GitHub: clone the repository (example: git clone https://github.com/2noise/ChatTTS).
Install dependencies (e.g., torch and ChatTTS via pip).
Import required libraries (torch, ChatTTS, and Audio from IPython.display).
Initialize ChatTTS and load pre-trained models.
Prepare your text input(s).
Generate speech with the infer method (use_decoder option can be enabled).
Play or save the generated audio using standard audio playback tools.
Reference example script provided in the project for quick setup.

Use Cases

Conversational tasks for LLM assistants.
Generating dialogue speech for video intros or educational content.
Any application requiring natural, dynamic speech synthesis in Chinese or English.
Potential integration into web, mobile, desktop, or embedded environments via provided SDKs/APIs.

Language and Data Details

Languages: English and Chinese.
Training data: ~100,000 hours of Chinese and English speech.
Open-source plans: base model trained on ~40,000 hours of data planned for release to researchers and developers.

Safety, Customization, and Extensibility

Open to customization via fine-tuning with user datasets for specific voices or domains.
Controllability enhancements, including watermarking, to improve safety and traceability when deployed with LLMs.
Open-source baselines enable experimentation and improvement by the community.

Core Features

Multi-language (English and Chinese) voice synthesis tailored for conversational tasks
High-quality, natural-sounding speech due to large-scale training data (~100k hours)
Open-source strategy with plans to release a base model trained on ~40k hours
Easy integration into LLM-powered applications and conversational systems
Controllability features and potential watermarks for safer deployments
Simple, text-to-speech workflow: input text, generate audio, and playback/save
Cross-environment compatibility (web, mobile, desktop, embedded) via SDKs/APIs

ChatTTS

Introduction

Tags

Featured

ElevenLabs

Dora Studio

DataFast

Hailuo AI

ChatTTS Product Information