ChatTTS: Text-to-Speech For Chat is a voice generation model optimized for conversational scenarios, designed to provide natural, fluid speech for dialogue tasks typical of large language model (LLM) assistants and applications such as conversational audio and video introductions. It supports English and Chinese and is trained on a large, diverse dataset to deliver high-quality, natural-sounding speech. The project emphasizes ease of use, multi-language support, and potential openness through open-source baselines.
How ChatTTS Works
- Multi-language support (English and Chinese) enables usage across diverse audiences.
- Trained on approximately 100,000 hours of Chinese and English data to achieve natural speech quality.
- Tailored for dialog tasks, providing coherent, context-aware voice responses in conversations.
- Plans to open-source a base model trained on 40,000 hours of data to foster research and collaboration.
- Focus on controllability, including watermarks and integration with LLMs for safe, reliable performance.
- Simple input: text only, which is converted into speech files.
How to Use ChatTTS
- Download from GitHub: clone the repository (example: git clone https://github.com/2noise/ChatTTS).
- Install dependencies (e.g., torch and ChatTTS via pip).
- Import required libraries (torch, ChatTTS, and Audio from IPython.display).
- Initialize ChatTTS and load pre-trained models.
- Prepare your text input(s).
- Generate speech with the infer method (use_decoder option can be enabled).
- Play or save the generated audio using standard audio playback tools.
- Reference example script provided in the project for quick setup.
Use Cases
- Conversational tasks for LLM assistants.
- Generating dialogue speech for video intros or educational content.
- Any application requiring natural, dynamic speech synthesis in Chinese or English.
- Potential integration into web, mobile, desktop, or embedded environments via provided SDKs/APIs.
Language and Data Details
- Languages: English and Chinese.
- Training data: ~100,000 hours of Chinese and English speech.
- Open-source plans: base model trained on ~40,000 hours of data planned for release to researchers and developers.
Safety, Customization, and Extensibility
- Open to customization via fine-tuning with user datasets for specific voices or domains.
- Controllability enhancements, including watermarking, to improve safety and traceability when deployed with LLMs.
- Open-source baselines enable experimentation and improvement by the community.
Core Features
- Multi-language (English and Chinese) voice synthesis tailored for conversational tasks
- High-quality, natural-sounding speech due to large-scale training data (~100k hours)
- Open-source strategy with plans to release a base model trained on ~40k hours
- Easy integration into LLM-powered applications and conversational systems
- Controllability features and potential watermarks for safer deployments
- Simple, text-to-speech workflow: input text, generate audio, and playback/save
- Cross-environment compatibility (web, mobile, desktop, embedded) via SDKs/APIs