$

Cost per second

For enterprise pricing and custom weights or models

ElevenLabs Dubbing: AI-Powered Audio & Video Translation API

What is ElevenLabs Dubbing?

ElevenLabs Dubbing is an AI-powered dubbing API that translates audio and video content into 29 languages while preserving each speaker's original voice, tone, emotion, and timing. Unlike traditional localization workflows that require re-recording with human voice actors, ElevenLabs Dubbing automates the entire pipeline — from speaker separation and transcription to translation, speech synthesis, and audio re-sync — in a single API call.

Built on ElevenLabs' Multilingual v2 model, it handles complex real-world content: overlapping dialogue, background music, ambient noise, whispers, and shouted lines. The result is natural-sounding, voice-cloned multilingual audio that maintains your original speaker's identity across languages.

Key Features

29-language support — English, Hindi, Spanish, Japanese, Arabic, French, German, Korean, Tamil, and more
Automatic speaker separation — detects and isolates multiple overlapping voices with zero manual configuration
Voice cloning with emotion retention — preserves accent, tone, and emotional nuance using Multilingual v2
Background audio preservation — music, SFX, and ambient noise survive the dubbing process intact
Segment-level dubbing — use start_time and end_time parameters to dub specific clips within longer files
Auto language detection — set source_lang: auto for hands-free source identification
Advanced controls — profanity filtering, highest-resolution output, voice cloning toggle, and CSV-based manual mode
Python and TypeScript SDKs — production-ready async API with straightforward status polling

Best Use Cases

Content creators and YouTubers localizing videos for Hindi, Spanish, or Arabic-speaking audiences
Podcast producers generating multilingual versions of long-form audio without re-recording
Media studios dubbing trailers, courses, or documentary content at scale
EdTech platforms delivering educational video in regional languages without hiring voice actors
App developers building programmatic translation pipelines for UGC platforms or streaming products
Corporate teams localizing training videos and product demos for global rollouts

Prompt Tips and Output Quality

Start with source_lang: auto unless you know the source language precisely — auto-detection is accurate and simplifies your workflow. For content with a known fixed language, specifying it directly speeds up processing.

Set num_speakers manually for dense dialogue. The default auto-detection works well for 1–3 speakers, but for panel discussions, interviews, or multi-character audio, providing an explicit count improves speaker separation quality significantly.

Use start_time and end_time for iteration. When testing output quality on long-form video, dub a representative 2–3 minute segment first before committing to full-file processing.

Keep drop_background_audio: false for most content. ElevenLabs Dubbing's ability to retain background music is a core differentiator — disabling it is best reserved for clean voiceover or podcast-only content.

Enable highest_resolution: true when dubbing video destined for broadcast, YouTube, or professional distribution.

Avoid CSV/manual mode in production. The manual mode with custom CSV transcripts is experimental and better suited for testing edge cases, not live pipelines.

FAQs

How long does dubbing take for a 30-minute video? Processing is asynchronous and scales with content length. A 30-minute video can take several minutes to process. Use the status polling endpoint to check job completion — avoid setting fixed timeouts.

Which audio and video formats are supported? The API accepts MP3, MP4, and most common audio/video formats via URL. You can also pass direct URLs from YouTube, TikTok, or cloud storage buckets.

Does voice cloning work for all 29 languages? Yes — voice cloning is applied by default across all supported languages using the Multilingual v2 model. Set disable_voice_cloning: true if you prefer generic ElevenLabs library voices instead.

What happens to background music during dubbing? By default, background audio (music, ambient sound, SFX) is separated and re-layered into the dubbed output. Set drop_background_audio: true only if you want a clean speech-only track.

Can I target a specific accent for dubbed voices? The target_accent parameter (e.g., "american", "british") is available but experimental. It's not recommended for production use and may produce inconsistent results across languages.

Is there a character or length limit? The API applies a character limit of approximately 3,000 characters per minute of content. Plan your content segmentation accordingly for very long or text-dense files.

Popular Models

SDXL Img2Img SDXL Img2Img is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers

SDXL Controlnet SDXL ControlNet gives unprecedented control over text-to-image generation. SDXL ControlNet models Introduces the concept of conditioning inputs, which provide additional information to guide the image generation process

Faceswap V2 Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

Codeformer CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.