POST
javascript
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 const axios = require('axios'); const FormData = require('form-data'); const api_key = "YOUR API-KEY"; const url = "https://api.segmind.com/v1/orpheus-3b-0.1"; const reqBody = { "text": "Today has been... exhausting. <sigh> First, I missed the bus. Then it started pouring rain—of course, I forgot my umbrella. <groan> And just when I thought things couldn’t get worse, I spilled coffee all over my white shirt right before the presentation. <cough> But hey, at least I survived... kind of.", "top_p": 0.95, "voice": "dan", "temperature": 0.6, "max_new_tokens": 1200, "repetition_penalty": 1.1 }; (async function() { try { const formData = new FormData(); // Append regular fields for (const key in reqBody) { if (reqBody.hasOwnProperty(key)) { formData.append(key, reqBody[key]); } } // Convert and append images as Base64 if necessary const response = await axios.post(url, formData, { headers: { 'x-api-key': api_key, ...formData.getHeaders() } }); console.log(response.data); } catch (error) { console.error('Error:', error.response ? error.response.data : error.message); } })();
RESPONSE
image/jpeg
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


textstr *

Input text to the model to convert to speech


top_pfloat ( default: 0.95 )

Top P for nucleus sampling. Recommended top-p: 0.2–0.4 for neutral tone, 0.6–0.8 for conversational, 0.8–1.0 for expressive, and 0.3–0.5 for assistants.

min : 0.1,

max : 1


voiceenum:str ( default: dan )

An enumeration.

Allowed values:


temperaturefloat ( default: 0.6 )

Temperature for generation. Controls expressiveness: 0.1–0.5 for stable speech, 0.6–1.0 for natural tone, 1.1–1.5 for expressive or dramatic voices.

min : 0.1,

max : 1.5


max_new_tokensint ( default: 1200 )

Maximum number of tokens to generate

min : 100,

max : 2000


repetition_penaltyfloat ( default: 1.1 )

Repetition penalty

min : 1,

max : 2

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

3B Orpheus TTS (0.1): Powering Audio Innovation

The 3B Orpheus TTS (0.1) by Canopy Labs is a game-changing text-to-speech model for developers and creators. Built on a 3-billion-parameter Llama-based Speech-LLM, trained on 100,000 hours of audio, and released under the Apache 2.0 license, it’s open-source and ready to transform your projects.

Key Features

  • Zero-Shot Voice Cloning: Replicate any voice instantly.
  • Emotion Control: Add mood with simple tags.
  • Low Latency: ~200ms streaming, down to ~100ms with optimization.
  • Multimodal Ready: Sync with visuals or animations.

For Developers

  • Easy integration via Python, Colab, or APIs.
  • Streaming support for real-time apps.
  • Lightweight for mobile, edge, or cloud use.

For Creators

  • Natural, emotive speech for narration or dialogue.
  • Clone voices for podcasts, games, or personal projects.
  • No proprietary limits—pure creative freedom.

Tips and Tricks

To get the best results from your TTS model, start with a top-p between 0.6 and 0.9 and a temperature around 0.7 to 1.0 for natural, conversational speech. If you need highly expressive or emotional voices—like for storytelling or character dialogue—increase both parameters (top-p closer to 1.0 and temperature up to 1.5). For more stable, clear, and predictable speech—such as virtual assistants or system prompts—use a lower top-p (0.2–0.5) and temperature (0.3–0.6). These settings help balance clarity, emotion, and control, depending on your use case. Experiment with small increments to fine-tune the voice to your specific needs.

This TTS model goes beyond plain speech—it can bring your audio to life with natural vocal expressions. It supports a range of tags like <laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, and <gasp>, allowing you to add realistic human touches to the voice. You can even use filler sounds like "uhm" to make the speech feel more casual and conversational. Whether you're building dialogue for games, interactive stories, or lifelike voice agents, these expressive tags help deliver a more immersive and emotionally rich experience.