POST

javascript

const axios = require('axios');


const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/gemini-2.5-pro-tts";

const data = {
  "text": "Narrate softly: I hope you find peace and joy.",
  "voice_1": "Kore",
  "temperature": 0.4
};

(async function() {
    try {
        const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
        console.log(response.data);
    } catch (error) {
        console.error('Error:', error.response.data);
    }
})();

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

textstr *

Input text for voice synthesis. Use styles like: 'soft, calm' or 'excited narration'.

voice_1str ( default: Kore )

Select a unique voice for Speaker 1. Ideal for character roles or distinct narrator styles.

voice_2str ( default: 1 )

Choose a second speaker voice for dialogues. Use contrasting voices for role distinction.

temperaturefloat ( default: 0.4 )

Adjusts speech variance. Use 0.2 for stable narration or 0.7 for dynamic expression.

min : 0,

max : 2

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Gemini 2.5 TTS: Multi-Speaker Text-to-Speech API

Edited by Segmind Team on December 29, 2025.

What is Gemini 2.5 TTS?

Gemini 2.5 Text-to-Speech represents a novel model for AI-driven voice synthesis that transforms plain text into rich, expressive speech across 24 languages. Developed by Google, this model comes in two forms: Gemini 2.5 Flash and Gemini 2.5 Pro, each capable of interpreting context in delivery while adjusting its tone, pace, and emotions with high precision. It is designed for a gamut of applications, everything from audiobooks and online learning to automated podcast creation. Gemini 2.5 TTS delivers speech that sounds genuinely human, with nuanced control over style and the ability to generate multi-speaker dialogue.

Key Features of Gemini 2.5 Text-to-Speech

Context-Aware Pacing: It can automatically speed up the speech to express excitement or slow it down for emphasis, matching natural speech patterns.
Multi-Speaker Support: It maintains distinct, consistent character voices across conversations with 30+ voice options.
Style Control: It offers direct prompt-based styling like "soft, calm narrator" or "excited storytelling" for granular tone management.
24 Language Support: It is integrated with native multilingual synthesis, authentic pronunciation, and regional nuances.
Temperature Controls: It offers fine-tune speech variance from robotic consistency (0.2) to expressive dynamism (0.7+).
Two-Voice Dialogue: It has a built-in dual-speaker system ideal for interviews, audiobook dialogues, or interactive content.

Best Use Cases

Audiobook Production: It can be utilized to generate distinct narrator and character voices with emotional range that adapts to the plot, making it perfect for indie publishers and content creators to scale audio production.
E-Learning & Training: It can be useful to create engaging educational content with varied speaking styles to emphasize vital concepts through apt pacing and tone modulation.
Podcast Automation: It effectively synthesizes intro/outro segments, sponsored content, or entire script readings with professional-quality voice consistency.
Marketing & Localization: It can be leveraged to produce voiceovers for video ads, product demos, or multilingual campaigns without expensive sessions in studios.
Virtual Assistants & Gaming: It is a perfect model to build characters with unique personalities through voice selection and style prompting for immersive user experiences.

Prompt Tips and Output Quality

Use Effective Prompt: Strategize the use of prompts by embedding stylistic directions directly in your text: "Narrate softly: The forest whispered ancient secrets." or "Excitedly announce: We've just launched our biggest update ever!"

Voice Selection: Choose contrasting voices for better clarity in dialogue: pair deep, measured voices like "Charon" with lighter, energetic options like "Puck" for dramatic effect.

Temperature Tuning:

0.2-0.4: Stable, professional narration for corporate or educational content
0.5-0.7: Dynamic storytelling with emotional variation
0.7+: Experimental, highly expressive delivery for creative projects

Multi-Speaker Syntax: When the audio needs two voices, structure your text with clear speaker tags or natural dialogue formatting to maintain voice consistency.

FAQs

Is Gemini 2.5 TTS open-source?
No, Gemini 2.5 TTS is a proprietary Google AI model accessible via API. However, it offers flexible integration options for developers through standard REST endpoints.

How does it compare to ElevenLabs or OpenAI TTS?
Gemini 2.5 TTS excels in context-aware pacing and native multi-speaker support without audio splicing to create continuous, seamless speech. It also offers deeper style control through prompt engineering as compared to traditional parameter adjustments, making it ideal for narrative content that needs emotional intelligence.

What parameters should I adjust for best results?
To achieve the best results, start with the default temperature (0.4) and experiment as per the type of content. For audiobooks, keep it 0.3-0.5; try 0.5-0.7 for marketing. Additionally, always test voice combinations since some voices naturally handle certain emotions better.

Can I use this for commercial projects?
Yes, the model supports commercial use cases. Check Google's specific licensing terms for your deployment scenario, particularly for large-scale content production.

Does it handle multiple languages in one request?
Gemini 2.5 TTS supports 24 languages, but optimal results are generated with single-language requests. For multilingual content, structure separate API calls per language segment to maintain pronunciation accuracy.

How do I maintain voice consistency across long documents?
To maintain voice consistency, use the same voice selection and temperature settings throughout. Furthermore, break extremely long texts into logical chapter or section calls, while preserving these parameters to ensure seamless audio continuity when there are multiple segments.

Popular Models

SDXL Controlnet SDXL ControlNet gives unprecedented control over text-to-image generation. SDXL ControlNet models Introduces the concept of conditioning inputs, which provide additional information to guide the image generation process

Story Diffusion Story Diffusion turns your written narratives into stunning image sequences.

Codeformer CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.

Faceswap Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training