1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
const axios = require('axios');
const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/higgsfield-speech2video";
const data = {
"input_image": "https://segmind-resources.s3.amazonaws.com/input/03cea2dd-87e9-41d7-9932-fbe45d4b2dd5-434b7481-1ddb-43da-a2df-10928effc900.png",
"input_audio": "https://segmind-resources.s3.amazonaws.com/input/a846542c-c555-43ae-bdb0-8795ef78e0bb-8fe7c335-9e7f-4729-8230-b3eabc2af49c.wav",
"prompt": "Generate an educational video with clear articulation, gentle hand gestures, and warm facial expressions appropriate for teaching content. All transitions needs to be super realistic and smooth.",
"quality": "high",
"enhance_prompt": false,
"seed": 42,
"duration": 10
};
(async function() {
try {
const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
console.log(response.data);
} catch (error) {
console.error('Error:', error.response.data);
}
})();
Provide a URL of the image to drive animation. Use a clear, high-quality image for best results.
URL for the audio guiding avatar speech. Use articulate speech for clear lip-sync results.
Describe the video output scenario. Create an engaging, emotional prompt for vibrant expressions.
Choose video quality preference. 'High' is best for detailed videos, while 'mid' helps with speed.
Allowed values:
Automatically refine your prompt. Enable to achieve a balanced expression across the video.
Set a seed number for consistent outputs. Use different seeds for variation, 42 is common.
min : 1,
max : 1000000
Decide video length in seconds. Choose longer durations for in-depth content.
Allowed values:
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
Edited by Segmind Team on September 21, 2025.
Speak v2 is a premium speech-to-video generation AI model developed by Higgsfield. It works flawlessly to transform static visual and basic audio assets into dynamic storytelling via a realistic, synchronized talking avatar video. The output is a super-realistic audio-visual synced generative video that mirrors natural facial expressions and speech patterns. Speak v2 offers high flexibility over the creative process to maintain the video's style and quality through multiple layers of parameters that can be modified and adapted at any level of workflow. This next-gen AI model empowers users to craft video avatars for various creative fields and use cases.
How do I achieve the best lip-sync quality? High-quality audio input with clear articulation will generate better results, and enabling the high-quality setting will further enhance the result. Your reference image should have a clear, front-facing shot of the face for better visual fidelity.
Can I control the speaking style and expressions? Yes, Speak v2 gives space to control the speaking style and expressions; you can use detailed prompts and the enhance_prompt parameter to adapt these aspects. Furthermore, clearly explain the desired emotional state and speaking style in your prompt for more precise control.
What image formats work best with Speak v2? The model accepts image URLs, but high-resolution, well-lit, front-facing photos can generate even better quality output. Always provide a reference image with a clearly visible face that is centered.
How can I ensure consistent results across multiple generations? If you use the seed parameter, it will maintain consistency throughout the workflow and multiple generations. The same seed value will produce similar animation patterns (when other inputs remain unchanged).
What's the maximum video duration possible? Speak v2 supports video durations of 5, 10, or 15 seconds; longer durations are available for more extensive content needs.