1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
const axios = require('axios');
const FormData = require('form-data');
const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/dia";
const reqBody = {
"text": "[S1] Segmind lets you build powerful image and video workflows — no code needed. \n [S2] Over 200 open and closed models. Just drag, drop, and deploy. \n [S1] Wait, seriously? Even custom models? \n [S2] Yup. Even fine-tuned ones. (chuckles) \n [S1] That's wild. I’ve spent weeks writing code for this. \n [S2] Now you can do it in minutes. Go try Segmind on the cloud. \n [S1] I'm sold. Let’s go. (laughs)",
"top_p": 0.95,
"cfg_scale": 4,
"temperature": 1.3,
"speed_factor": 0.94,
"max_new_tokens": 3072,
"cfg_filter_top_k": 35
};
(async function() {
try {
const formData = new FormData();
// Append regular fields
for (const key in reqBody) {
if (reqBody.hasOwnProperty(key)) {
formData.append(key, reqBody[key]);
}
}
// Convert and append images as Base64 if necessary
const response = await axios.post(url, formData, {
headers: {
'x-api-key': api_key,
...formData.getHeaders()
}
});
console.log(response.data);
} catch (error) {
console.error('Error:', error.response ? error.response.data : error.message);
}
})();
Use a seed for reproducible results. Leave blank for random output.
Input text for speech generation. Use [S1], [S2] for speakers and ( ) for actions like (laughs) or (whispers). Verbal tags will be recognized, but might result in unexpected output.
Controls word variety. Higher values allow rarer words. Most users can leave this as is.
min : 0.1,
max : 1
Controls how strictly audio follows text. Higher = more accurate, lower = more natural. (1 to 5)
min : 1,
max : 5
Controls randomness. Higher (1.4–2.0) = more variety, lower (0.1–1.0) = more consistency. Values can be 0.1 to 2.
min : 0.1,
max : 2
Audio file in: .wav .mp3 .flac, for voice cloning. Model will clone this voice style.
Controls playback speed. 1.0 = normal, below 1.0 = slower. Values can be 0.5 to 1.5
min : 0.5,
max : 1.5
Controls audio length. Higher values = longer audio (≈86 tokens per second). Values can be 500 to 4096
min : 500,
max : 4096
Filters audio tokens. Higher values = more diverse sounds, lower = more consistent. Values can be 10 to 100.
min : 10,
max : 100
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Dia is a cutting-edge, 1.6 billion parameter text-to-speech model developed by Nari Labs — where "Nari" (나리) means lily in pure Korean. It is designed to produce ultra-realistic, podcast-style dialogue directly from text inputs. Unlike traditional TTS systems that often sound robotic or lack expressive nuance, Dia excels at generating lifelike, multi-speaker conversations complete with emotional tone adjustments and non-verbal cues such as pauses, laughter, and coughing. This level of expressiveness and control makes Dia a game changer in the field, enabling creators to craft engaging audio for podcasts, audiobooks, video game characters, and conversational interfaces without the need for high-end proprietary solutions. It is ideal for conversational AI, storytelling, dubbing, and interactive voice applications.
Technically, Dia is built as a 1.6 billion parameter model optimized specifically for natural dialogue synthesis, distinguishing it from general-purpose TTS models. The architecture supports advanced features such as audio conditioning, where users can guide the generated speech’s tone, emotion, or delivery style using short audio samples. It also allows script-level control with embedded commands for non-verbal sounds, enhancing the realism of the output. The model was trained using Google’s TPU Cloud, making it efficient enough to run on most modern computers, though the full version requires around 10GB of VRAM, with plans for a more lightweight, quantized release in the future. By releasing both the model weights and inference code openly, Nari Labs fosters community-driven innovation and transparency, positioning Dia as a versatile and accessible tool for next-generation speech synthesis.