1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const axios = require('axios');
const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/tts-elevenlabs-with-timestamps";
const data = {
"prompt": "Explore the depths of the ocean with us, where mystery and wonder await.",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"model_id": "eleven_multilingual_v2",
"stability": 0.7,
"use_speaker_boost": true,
"similarity_boost": 0.85,
"style": 0.3,
"speed": 1.1,
"seed": 12345678,
"apply_text_normalization": "auto",
"apply_language_text_normalization": false
};
(async function() {
try {
const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
console.log(response.data);
} catch (error) {
console.error('Error:', error.response.data);
}
})();Text for generating audio. Use short phrases for quick outputs and longer text for detailed responses.
Set a voice ID for personalized output. Use default for standard or specific ID for a custom voice.
Choose a TTS model. Use 'multilingual' for global languages and 'turbo' for faster processing.
Allowed values:
Specify language for output using ISO 639-1 code. Set 'en' for English, or default for automatic detection.
Allowed values:
Controls emotional range in voice. Use lower values for more emotion or higher for consistent tone.
min : 0,
max : 1
Enhance similarity to the original voice. Enable for maximum likeness; disable for more neutral tone.
Adjusts voice similarity to original. Higher values for likeness, lower for variability.
min : 0,
max : 1
Sets style exaggeration. Use low for subtle, natural sound or high for dramatic effect.
min : 0,
max : 1
Adjusts speech pace. Set lower for clarity, higher for quick delivery.
min : 0.25,
max : 4
Ensure repeatable outputs with a seed. Use '0' for randomization or set a number for consistency.
min : 0,
max : 4294967295
Handle text normalization. Set 'auto' for default, 'on' for guaranteed, or 'off' for text as is.
Allowed values:
Enable language-specific text normalization if supported, mostly for Japanese. Use-default otherwise.
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
ElevenLabs Text-to-Speech (TTS) is a suite of advanced AI models that transform written text into highly realistic, emotionally expressive spoken audio. Unlike traditional TTS systems that sound robotic or flat, ElevenLabs uses deep learning to generate voices with natural prosody, emotion, and cultural nuance. The platform offers multiple model variants optimized for different use cases—from Eleven v3's emotion-rich delivery across 70+ languages to Flash v2.5's ultra-low latency for real-time streaming applications.
These models adapt tone and inflection based on textual context, making them suitable for professional voiceovers, audiobook narration, interactive conversational agents, and multilingual content localization. Developers can control voice characteristics through parameters like stability, similarity boost, and style exaggeration to fine-tune output quality.
Media Production: Generate voiceovers for video content, podcasts, and documentary narration with professional-grade audio quality.
Accessibility: Convert written content into speech for visually impaired users or learning applications requiring audio formats.
Gaming & Interactive Media: Create dynamic NPC dialogue and character voices that respond contextually to player actions.
E-Learning Platforms: Produce multilingual course materials with consistent voice quality across languages.
Customer Service: Power conversational AI agents with natural-sounding voices for phone systems and chatbots.
Audiobook Production: Scale audiobook creation across genres and languages while maintaining narrative consistency and emotional depth.
Text Structure: Use punctuation strategically—commas add natural pauses, ellipses create suspense, and exclamation marks increase energy. Short phrases (under 500 characters) generate faster; longer text enables better contextual understanding.
Parameter Optimization:
Model Selection: Use Eleven v3 for maximum emotional range and language support; Flash v2.5 for real-time chat applications; Multilingual v2 for long-form audiobooks; Turbo v2.5 when balancing quality and latency.
Reproducibility: Set a fixed seed value to generate identical audio outputs across API calls—critical for A/B testing and iterative refinement.
Which model should I choose for my application?
Use Eleven v3 for premium emotional content across 70+ languages, Flash v2.5 for latency-critical real-time streaming, Multilingual v2 for stable long-form narration, and Turbo v2.5 when you need both quality and speed across 32 languages.
How do I customize voice characteristics?
Adjust stability to control emotional variation, similarity_boost to match a reference voice, and style to exaggerate delivery. Enable use_speaker_boost to enhance voice clone accuracy. Combine parameters iteratively based on your quality requirements.
What's the difference between stability and similarity boost?
Stability controls emotional range (low = expressive, high = consistent), while similarity_boost determines how closely the output matches the original Voice ID (high = accurate clone, low = more interpretation).
Can I generate the same audio output repeatedly?
Yes. Set a fixed seed value (any integer between 0 and 4,294,967,295) to ensure deterministic outputs. Use seed: 0 for randomization across API calls.
What languages are supported?
The platform supports 30+ languages including English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Chinese, Japanese, Korean, and more. Specify language using ISO 639-1 codes via language_code parameter, or leave empty for automatic detection.
How does text normalization affect output?
Text normalization converts written formats (numbers, dates, abbreviations) into speakable text. Set apply_text_normalization: "auto" for smart handling, "on" to force normalization, or "off" to preserve text exactly as written. Enable apply_language_text_normalization for language-specific rules (currently optimized for Japanese).