POST

javascript

const axios = require('axios');


const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/tts-elevenlabs-with-timestamps";

const data = {
  "prompt": "Explore the depths of the ocean with us, where mystery and wonder await.",
  "voice_id": "21m00Tcm4TlvDq8ikWAM",
  "model_id": "eleven_multilingual_v2",
  "language_code": "en",
  "stability": 0.7,
  "use_speaker_boost": true,
  "similarity_boost": 0.85,
  "style": 0.3,
  "speed": 1.1,
  "seed": 12345678,
  "apply_text_normalization": "auto",
  "apply_language_text_normalization": false
};

(async function() {
    try {
        const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
        console.log(response.data);
    } catch (error) {
        console.error('Error:', error.response.data);
    }
})();

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

promptstr *

Text for generating audio. Use short phrases for quick outputs and longer text for detailed responses.

voice_idstr ( default: 21m00Tcm4TlvDq8ikWAM )

Set a voice ID for personalized output. Use default for standard or specific ID for a custom voice.

model_idenum:str ( default: eleven_multilingual_v2 )

Choose a TTS model. Use 'multilingual' for global languages and 'turbo' for faster processing.

Allowed values:

language_codeenum:str ( default: 1 )

Specify language for output using ISO 639-1 code. Set 'en' for English, or default for automatic detection.

Allowed values:

stabilityfloat ( default: 0.7 )

Controls emotional range in voice. Use lower values for more emotion or higher for consistent tone.

min : 0,

max : 1

use_speaker_boostboolean ( default: true )

Enhance similarity to the original voice. Enable for maximum likeness; disable for more neutral tone.

similarity_boostfloat ( default: 0.85 )

Adjusts voice similarity to original. Higher values for likeness, lower for variability.

min : 0,

max : 1

stylefloat ( default: 0.3 ) Affects Pricing

Sets style exaggeration. Use low for subtle, natural sound or high for dramatic effect.

min : 0,

max : 1

speedfloat ( default: 1.1 )

Adjusts speech pace. Set lower for clarity, higher for quick delivery.

min : 0.25,

max : 4

seedint ( default: 12345678 )

Ensure repeatable outputs with a seed. Use '0' for randomization or set a number for consistency.

min : 0,

max : 4294967295

apply_text_normalizationenum:str ( default: auto )

Handle text normalization. Set 'auto' for default, 'on' for guaranteed, or 'off' for text as is.

Allowed values:

apply_language_text_normalizationboolean ( default: 1 )

Enable language-specific text normalization if supported, mostly for Japanese. Use-default otherwise.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

ElevenLabs Text-to-Speech: Advanced AI Voice Generation

What is ElevenLabs Text-to-Speech?

ElevenLabs Text-to-Speech (TTS) is a suite of advanced AI models that transform written text into highly realistic, emotionally expressive spoken audio. Unlike traditional TTS systems that sound robotic or flat, ElevenLabs uses deep learning to generate voices with natural prosody, emotion, and cultural nuance. The platform offers multiple model variants optimized for different use cases—from Eleven v3's emotion-rich delivery across 70+ languages to Flash v2.5's ultra-low latency for real-time streaming applications.

These models adapt tone and inflection based on textual context, making them suitable for professional voiceovers, audiobook narration, interactive conversational agents, and multilingual content localization. Developers can control voice characteristics through parameters like stability, similarity boost, and style exaggeration to fine-tune output quality.

Key Features

Multi-Model Architecture: Choose from eight specialized models including Eleven v3 (70+ languages), Multilingual v2 (29 languages), Flash v2.5 (ultra-low latency), and Turbo v2.5 (32 languages with speed optimization)
Emotional Intelligence: Models naturally interpret text sentiment and adjust tone, pitch, and pacing accordingly
Voice Cloning Support: Use custom Voice IDs to replicate specific speakers with high fidelity
Multilingual Excellence: Native support for 30+ languages with proper pronunciation and cultural intonation
Real-Time Streaming: Flash and Turbo models deliver sub-300ms latency for live applications
Granular Control: Adjust stability, similarity boost, style, speed, and speaker boost for precise voice tuning

Best Use Cases

Media Production: Generate voiceovers for video content, podcasts, and documentary narration with professional-grade audio quality.

Accessibility: Convert written content into speech for visually impaired users or learning applications requiring audio formats.

Gaming & Interactive Media: Create dynamic NPC dialogue and character voices that respond contextually to player actions.

E-Learning Platforms: Produce multilingual course materials with consistent voice quality across languages.

Customer Service: Power conversational AI agents with natural-sounding voices for phone systems and chatbots.

Audiobook Production: Scale audiobook creation across genres and languages while maintaining narrative consistency and emotional depth.

Prompt Tips and Output Quality

Text Structure: Use punctuation strategically—commas add natural pauses, ellipses create suspense, and exclamation marks increase energy. Short phrases (under 500 characters) generate faster; longer text enables better contextual understanding.

Parameter Optimization:

Stability (0.0-1.0): Lower values (0.3-0.5) create expressive, emotional speech; higher values (0.7-1.0) produce consistent, stable narration
Similarity Boost (0.0-1.0): Set to 0.85+ for maximum voice clone accuracy; reduce to 0.5-0.7 for more variation
Style (0.0-1.0): Use 0.0-0.3 for natural delivery; increase to 0.7+ for dramatic, exaggerated performances
Speed (0.25-4.0): Default 1.0 is natural; 1.5+ for fast-paced content; 0.75 for deliberate, educational material

Model Selection: Use Eleven v3 for maximum emotional range and language support; Flash v2.5 for real-time chat applications; Multilingual v2 for long-form audiobooks; Turbo v2.5 when balancing quality and latency.

Reproducibility: Set a fixed seed value to generate identical audio outputs across API calls—critical for A/B testing and iterative refinement.

FAQs

Which model should I choose for my application?
Use Eleven v3 for premium emotional content across 70+ languages, Flash v2.5 for latency-critical real-time streaming, Multilingual v2 for stable long-form narration, and Turbo v2.5 when you need both quality and speed across 32 languages.

How do I customize voice characteristics?
Adjust stability to control emotional variation, similarity_boost to match a reference voice, and style to exaggerate delivery. Enable use_speaker_boost to enhance voice clone accuracy. Combine parameters iteratively based on your quality requirements.

What's the difference between stability and similarity boost?
Stability controls emotional range (low = expressive, high = consistent), while similarity_boost determines how closely the output matches the original Voice ID (high = accurate clone, low = more interpretation).

Can I generate the same audio output repeatedly?
Yes. Set a fixed seed value (any integer between 0 and 4,294,967,295) to ensure deterministic outputs. Use seed: 0 for randomization across API calls.

What languages are supported?
The platform supports 30+ languages including English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Chinese, Japanese, Korean, and more. Specify language using ISO 639-1 codes via language_code parameter, or leave empty for automatic detection.

How does text normalization affect output?
Text normalization converts written formats (numbers, dates, abbreviations) into speakable text. Set apply_text_normalization: "auto" for smart handling, "on" to force normalization, or "off" to preserve text exactly as written. Enable apply_language_text_normalization for language-specific rules (currently optimized for Japanese).

Popular Models

SDXL Img2Img SDXL Img2Img is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers

SadTalker Audio-based Lip Synchronization for Talking Head Video

face-to-many Turn a face into 3D, emoji, pixel art, video game, claymation or toy

SDXL Inpaint This model is capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask