POST

javascript

const axios = require('axios');


const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/higgsfield-speech2video";

const data = {
  "input_image": "https://segmind-resources.s3.amazonaws.com/input/03cea2dd-87e9-41d7-9932-fbe45d4b2dd5-434b7481-1ddb-43da-a2df-10928effc900.png",
  "input_audio": "https://segmind-resources.s3.amazonaws.com/input/a846542c-c555-43ae-bdb0-8795ef78e0bb-8fe7c335-9e7f-4729-8230-b3eabc2af49c.wav",
  "prompt": "Generate an educational video with clear articulation, gentle hand gestures, and warm facial expressions appropriate for teaching content. All transitions needs to be super realistic and smooth.",
  "quality": "high",
  "enhance_prompt": false,
  "seed": 42,
  "duration": 10
};

(async function() {
    try {
        const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
        console.log(response.data);
    } catch (error) {
        console.error('Error:', error.response.data);
    }
})();

RESPONSE

image/jpeg

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

input_imagestr *

Provide a URL of the image to drive animation. Use a clear, high-quality image for best results.

input_audiostr *

URL for the audio guiding avatar speech. Use articulate speech for clear lip-sync results.

promptstr *

Describe the video output scenario. Create an engaging, emotional prompt for vibrant expressions.

qualityenum:str ( default: high ) Affects Pricing

Choose video quality preference. 'High' is best for detailed videos, while 'mid' helps with speed.

Allowed values:

enhance_promptboolean ( default: 1 )

Automatically refine your prompt. Enable to achieve a balanced expression across the video.

seedint ( default: 42 )

Set a seed number for consistent outputs. Use different seeds for variation, 42 is common.

min : 1,

max : 1000000

durationenum:int ( default: 10 ) Affects Pricing

Decide video length in seconds. Choose longer durations for in-depth content.

Allowed values:

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Speak v2: Speech-to-Video Generation Model

Edited by Segmind Team on September 21, 2025.

What is Speak v2?

Speak v2 is a premium speech-to-video generation AI model developed by Higgsfield. It works flawlessly to transform static visual and basic audio assets into dynamic storytelling via a realistic, synchronized talking avatar video. The output is a super-realistic audio-visual synced generative video that mirrors natural facial expressions and speech patterns. Speak v2 offers high flexibility over the creative process to maintain the video's style and quality through multiple layers of parameters that can be modified and adapted at any level of workflow. This next-gen AI model empowers users to craft video avatars for various creative fields and use cases.

Key Features Speak v2

It produces videos with exceptional realism, accuracy in lip-sync, and natural facial expressions
It supports custom image inputs and audio files (MP3 format)
It provides adjustable video quality settings (high/mid) for different use cases
It has customizable video duration options (5, 10, or 15 seconds)
It comes with a prompt enhancement capability for optimized expressions
It can render reproducible results through seed parameter control
It includes seamless API integration with a structured response format

Best Use Cases

Virtual Presenters: It is ideal to create professional videos with a spokesperson avatar for corporate communications
Educational Content: It can be utilized to generate engaging teacher avatars for e-learning platforms
Marketing Materials: It can produce customized video advertisements with consistent messaging
Digital Avatars: It makes it convenient to build interactive character animations for gaming and entertainment
Social Media Content: Creative teams can design dynamic talking-head videos for social platforms
Multilingual Content: It makes it possible to generate videos with synchronized translations for mass reach

Prompt Tips and Output Quality

Use clear, high-quality reference images for a realistic avatar animation
Provide detailed prompts describing desired emotional expressions and speaking style for the avatar/ video
Enable prompt enhancement for more balanced and natural expressions
Consider using "high" quality setting for client-facing content
Experiment with different seeds to find the most appealing animation style
Keep audio inputs clean and well-articulated for better lip-sync results

FAQs

How do I achieve the best lip-sync quality? High-quality audio input with clear articulation will generate better results, and enabling the high-quality setting will further enhance the result. Your reference image should have a clear, front-facing shot of the face for better visual fidelity.

Can I control the speaking style and expressions? Yes, Speak v2 gives space to control the speaking style and expressions; you can use detailed prompts and the enhance_prompt parameter to adapt these aspects. Furthermore, clearly explain the desired emotional state and speaking style in your prompt for more precise control.

What image formats work best with Speak v2? The model accepts image URLs, but high-resolution, well-lit, front-facing photos can generate even better quality output. Always provide a reference image with a clearly visible face that is centered.

How can I ensure consistent results across multiple generations? If you use the seed parameter, it will maintain consistency throughout the workflow and multiple generations. The same seed value will produce similar animation patterns (when other inputs remain unchanged).

What's the maximum video duration possible? Speak v2 supports video durations of 5, 10, or 15 seconds; longer durations are available for more extensive content needs.

Popular Models

Story Diffusion Story Diffusion turns your written narratives into stunning image sequences.

IDM VTON Best-in-class clothing virtual try on in the wild

Faceswap V2 Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

Stable Diffusion XL 1.0 The SDXL model is the official upgrade to the v1.5 model. The model is released as open-source software