1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
const axios = require('axios');
const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/chatterbox-turbo-tts";
const data = {
"text": "Introducing the AI-enhanced Skyline T-800. [laugh] It's got integrated AI for superior performance. Want to explore pricing?",
"reference_audio": "https://segmind-resources.s3.amazonaws.com/output/70e9a9dd-e0f4-450f-a590-872996b44a01-chatterbox-turbo-input.mp3",
"temperature": 0.7,
"seed": 42,
"min_p": 0,
"top_p": 0.9,
"top_k": 500,
"repetition_penalty": 1.3,
"norm_loudness": true
};
(async function() {
try {
const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
console.log(response.data);
} catch (error) {
console.error('Error:', error.response.data);
}
})();Enter text to synthesize into speech. Use sound tags like [chuckle] for expression.
Provide a reference audio URL for voice matching. Use a public link for seamless access.
Adjust creativity and expressiveness in speech. Use lower for consistency, higher for diversity.
min : 0.05,
max : 2
Set random seed for reproducible results. Zero for random, specific value for consistency.
Set minimum token probability. Higher values filter out unlikely outputs.
min : 0,
max : 1
Control randomness via cumulative probability tokens. Lower for predictable, higher for varied outputs.
min : 0,
max : 1
Limit to top K likely tokens. Lower for concise, higher for diverse expressions.
min : 0,
max : 1000
Penalty for repeated words. Higher values reduce redundancy in speech.
min : 1,
max : 2
Normalize audio to standard loudness. Recommended for consistent volume across outputs.
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Chatterbox-Turbo is an open-source text-to-speech (TTS) model from Resemble AI designed for developers who need lightning-fast, high-quality speech synthesis without heavy compute requirements. With just 350 million parameters, this distilled model generates natural-sounding English speech in a single inference step—making it one of the most efficient TTS solutions for real-time applications. Unlike traditional TTS systems, Chatterbox-Turbo supports paralinguistic tags like [laugh], [cough], and [chuckle], enabling more expressive and human-like voice outputs. It also features voice cloning capabilities through reference audio, allowing you to match specific voice characteristics for consistent brand experiences or personalized interactions.
[laugh], [sigh], [cough]) to add realismEffective text formatting: Embed sound tags naturally within sentences—"Welcome back! [chuckle] Let's dive in." works better than appending tags awkwardly.
Voice cloning guidance: Use clean reference audio (at least 3-5 seconds) with minimal background noise. Public URLs ensure seamless processing.
Parameter tuning:
Seed control: Set a fixed seed for reproducible outputs during testing; use random (seed=0) for production diversity.
Is Chatterbox-Turbo open-source?
Yes. The model is fully open-source and available for both commercial and research projects without licensing restrictions.
How does voice cloning work?
Provide a reference audio URL (WAV format recommended). The model analyzes vocal characteristics like pitch, timbre, and cadence to match the target voice in generated speech.
What languages does it support?
Chatterbox-Turbo is optimized for English. While it may handle other languages, quality and accuracy are not guaranteed outside English inputs.
Can I use this for real-time applications?
Absolutely. The single-step generation architecture and low VRAM requirements make it ideal for live voice agents, streaming narration, and interactive systems.
What's the difference between Chatterbox-Turbo and other TTS models?
Chatterbox-Turbo prioritizes speed and efficiency without sacrificing quality. Its distilled 350M-parameter design runs faster than multi-billion-parameter models like Tortoise or Bark, while paralinguistic tag support adds expressiveness missing from simpler TTS systems.
How do I control speech emotion and style?
Use paralinguistic tags ([laugh], [sigh], [whisper]) embedded in your text. Adjust temperature for overall expressiveness—higher values create more dynamic delivery, while lower values maintain neutral tone.