1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
const axios = require('axios');
const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/gemini-2.5-pro-tts";
const data = {
"text": "Narrate softly: I hope you find peace and joy.",
"voice_1": "Kore",
"temperature": 0.4
};
(async function() {
try {
const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
console.log(response.data);
} catch (error) {
console.error('Error:', error.response.data);
}
})();Input text for voice synthesis. Use styles like: 'soft, calm' or 'excited narration'.
Select a unique voice for Speaker 1. Ideal for character roles or distinct narrator styles.
Choose a second speaker voice for dialogues. Use contrasting voices for role distinction.
Adjusts speech variance. Use 0.2 for stable narration or 0.7 for dynamic expression.
min : 0,
max : 2
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Edited by Segmind Team on December 29, 2025.
Gemini 2.5 Text-to-Speech represents a novel model for AI-driven voice synthesis that transforms plain text into rich, expressive speech across 24 languages. Developed by Google, this model comes in two forms: Gemini 2.5 Flash and Gemini 2.5 Pro, each capable of interpreting context in delivery while adjusting its tone, pace, and emotions with high precision. It is designed for a gamut of applications, everything from audiobooks and online learning to automated podcast creation. Gemini 2.5 TTS delivers speech that sounds genuinely human, with nuanced control over style and the ability to generate multi-speaker dialogue.
Use Effective Prompt: Strategize the use of prompts by embedding stylistic directions directly in your text: "Narrate softly: The forest whispered ancient secrets." or "Excitedly announce: We've just launched our biggest update ever!"
Voice Selection: Choose contrasting voices for better clarity in dialogue: pair deep, measured voices like "Charon" with lighter, energetic options like "Puck" for dramatic effect.
Temperature Tuning:
Multi-Speaker Syntax: When the audio needs two voices, structure your text with clear speaker tags or natural dialogue formatting to maintain voice consistency.
Is Gemini 2.5 TTS open-source?
No, Gemini 2.5 TTS is a proprietary Google AI model accessible via API. However, it offers flexible integration options for developers through standard REST endpoints.
How does it compare to ElevenLabs or OpenAI TTS?
Gemini 2.5 TTS excels in context-aware pacing and native multi-speaker support without audio splicing to create continuous, seamless speech. It also offers deeper style control through prompt engineering as compared to traditional parameter adjustments, making it ideal for narrative content that needs emotional intelligence.
What parameters should I adjust for best results?
To achieve the best results, start with the default temperature (0.4) and experiment as per the type of content. For audiobooks, keep it 0.3-0.5; try 0.5-0.7 for marketing. Additionally, always test voice combinations since some voices naturally handle certain emotions better.
Can I use this for commercial projects?
Yes, the model supports commercial use cases. Check Google's specific licensing terms for your deployment scenario, particularly for large-scale content production.
Does it handle multiple languages in one request?
Gemini 2.5 TTS supports 24 languages, but optimal results are generated with single-language requests. For multilingual content, structure separate API calls per language segment to maintain pronunciation accuracy.
How do I maintain voice consistency across long documents?
To maintain voice consistency, use the same voice selection and temperature settings throughout. Furthermore, break extremely long texts into logical chapter or section calls, while preserving these parameters to ensure seamless audio continuity when there are multiple segments.