POST
javascript
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 const axios = require('axios'); const api_key = "YOUR API-KEY"; const url = "https://api.segmind.com/v1/chatterbox-turbo-tts"; const data = { "text": "Introducing the AI-enhanced Skyline T-800. [laugh] It's got integrated AI for superior performance. Want to explore pricing?", "reference_audio": "https://segmind-resources.s3.amazonaws.com/output/70e9a9dd-e0f4-450f-a590-872996b44a01-chatterbox-turbo-input.mp3", "temperature": 0.7, "seed": 42, "min_p": 0, "top_p": 0.9, "top_k": 500, "repetition_penalty": 1.3, "norm_loudness": true }; (async function() { try { const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } }); console.log(response.data); } catch (error) { console.error('Error:', error.response.data); } })();
RESPONSE
image/jpeg
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


textstr *

Enter text to synthesize into speech. Use sound tags like [chuckle] for expression.


reference_audiostr *

Provide a reference audio URL for voice matching. Use a public link for seamless access.


temperaturefloat ( default: 0.7 )

Adjust creativity and expressiveness in speech. Use lower for consistency, higher for diversity.

min : 0.05,

max : 2


seedint ( default: 42 )

Set random seed for reproducible results. Zero for random, specific value for consistency.


min_pfloat ( default: 1 )

Set minimum token probability. Higher values filter out unlikely outputs.

min : 0,

max : 1


top_pfloat ( default: 0.9 )

Control randomness via cumulative probability tokens. Lower for predictable, higher for varied outputs.

min : 0,

max : 1


top_kint ( default: 500 )

Limit to top K likely tokens. Lower for concise, higher for diverse expressions.

min : 0,

max : 1000


repetition_penaltyfloat ( default: 1.3 )

Penalty for repeated words. Higher values reduce redundancy in speech.

min : 1,

max : 2


norm_loudnessbool ( default: true )

Normalize audio to standard loudness. Recommended for consistent volume across outputs.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Chatterbox-Turbo: Fast Text-to-Speech Model with Voice Cloning

What is Chatterbox-Turbo?

Chatterbox-Turbo is an open-source text-to-speech (TTS) model from Resemble AI designed for developers who need lightning-fast, high-quality speech synthesis without heavy compute requirements. With just 350 million parameters, this distilled model generates natural-sounding English speech in a single inference step—making it one of the most efficient TTS solutions for real-time applications. Unlike traditional TTS systems, Chatterbox-Turbo supports paralinguistic tags like [laugh], [cough], and [chuckle], enabling more expressive and human-like voice outputs. It also features voice cloning capabilities through reference audio, allowing you to match specific voice characteristics for consistent brand experiences or personalized interactions.

Key Features

  • Ultra-low latency: Single-step audio generation optimized for real-time conversational agents
  • Voice cloning: Match any voice using a reference audio URL for consistent character voices
  • Paralinguistic control: Built-in support for emotional tags ([laugh], [sigh], [cough]) to add realism
  • Lightweight architecture: 350M parameters require less VRAM than comparable models
  • High-fidelity output: Natural prosody and intonation without quality compromise
  • Open-source: Fully accessible for commercial and research applications

Best Use Cases

  • Voice agents and chatbots: Power customer service bots with natural, low-latency responses
  • Interactive media: Build dynamic narration for games, virtual assistants, and AR/VR experiences
  • Content creation: Generate audiobooks, podcast drafts, or video voiceovers at scale
  • Accessibility tools: Create screen readers and navigation aids with consistent voice quality
  • E-learning platforms: Produce course narration with expressive, engaging delivery
  • Brand voice consistency: Clone corporate voices for product demos and marketing content

Prompt Tips and Output Quality

Effective text formatting: Embed sound tags naturally within sentences—"Welcome back! [chuckle] Let's dive in." works better than appending tags awkwardly.

Voice cloning guidance: Use clean reference audio (at least 3-5 seconds) with minimal background noise. Public URLs ensure seamless processing.

Parameter tuning:

  • Temperature (0.05–2.0): Lower values (0.5–0.7) yield consistent, predictable speech; higher values (1.0–1.5) add expressiveness and variation
  • Top P/Top K: Keep defaults (0.9 / 500) for balanced outputs; reduce for more conservative pronunciation
  • Repetition penalty (1.0–2.0): Increase to 1.5+ if the model repeats phrases unnaturally
  • Normalize loudness: Always enable for production use to maintain consistent volume

Seed control: Set a fixed seed for reproducible outputs during testing; use random (seed=0) for production diversity.

FAQs

Is Chatterbox-Turbo open-source?
Yes. The model is fully open-source and available for both commercial and research projects without licensing restrictions.

How does voice cloning work?
Provide a reference audio URL (WAV format recommended). The model analyzes vocal characteristics like pitch, timbre, and cadence to match the target voice in generated speech.

What languages does it support?
Chatterbox-Turbo is optimized for English. While it may handle other languages, quality and accuracy are not guaranteed outside English inputs.

Can I use this for real-time applications?
Absolutely. The single-step generation architecture and low VRAM requirements make it ideal for live voice agents, streaming narration, and interactive systems.

What's the difference between Chatterbox-Turbo and other TTS models?
Chatterbox-Turbo prioritizes speed and efficiency without sacrificing quality. Its distilled 350M-parameter design runs faster than multi-billion-parameter models like Tortoise or Bark, while paralinguistic tag support adds expressiveness missing from simpler TTS systems.

How do I control speech emotion and style?
Use paralinguistic tags ([laugh], [sigh], [whisper]) embedded in your text. Adjust temperature for overall expressiveness—higher values create more dynamic delivery, while lower values maintain neutral tone.