POST
javascript
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 const axios = require('axios'); const api_key = "YOUR API-KEY"; const url = "https://api.segmind.com/v1/higgsfield-speech2video"; const data = { "input_image": "https://segmind-resources.s3.amazonaws.com/input/03cea2dd-87e9-41d7-9932-fbe45d4b2dd5-434b7481-1ddb-43da-a2df-10928effc900.png", "input_audio": "https://segmind-resources.s3.amazonaws.com/input/a846542c-c555-43ae-bdb0-8795ef78e0bb-8fe7c335-9e7f-4729-8230-b3eabc2af49c.wav", "prompt": "Generate an educational video with clear articulation, gentle hand gestures, and warm facial expressions appropriate for teaching content. All transitions needs to be super realistic and smooth.", "quality": "high", "enhance_prompt": false, "seed": 42, "duration": 10 }; (async function() { try { const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } }); console.log(response.data); } catch (error) { console.error('Error:', error.response.data); } })();
RESPONSE
image/jpeg
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


input_imagestr *

Provide a URL of the image to drive animation. Use a clear, high-quality image for best results.


input_audiostr *

URL for the audio guiding avatar speech. Use articulate speech for clear lip-sync results.


promptstr *

Describe the video output scenario. Create an engaging, emotional prompt for vibrant expressions.


qualityenum:str ( default: high ) Affects Pricing

Choose video quality preference. 'High' is best for detailed videos, while 'mid' helps with speed.

Allowed values:


enhance_promptboolean ( default: 1 )

Automatically refine your prompt. Enable to achieve a balanced expression across the video.


seedint ( default: 42 )

Set a seed number for consistent outputs. Use different seeds for variation, 42 is common.

min : 1,

max : 1000000


durationenum:int ( default: 10 ) Affects Pricing

Decide video length in seconds. Choose longer durations for in-depth content.

Allowed values:

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Speak v2: Speech-to-Video Generation Model

Edited by Segmind Team on September 21, 2025.

What is Speak v2?

Speak v2 is a premium speech-to-video generation AI model developed by Higgsfield. It works flawlessly to transform static visual and basic audio assets into dynamic storytelling via a realistic, synchronized talking avatar video. The output is a super-realistic audio-visual synced generative video that mirrors natural facial expressions and speech patterns. Speak v2 offers high flexibility over the creative process to maintain the video's style and quality through multiple layers of parameters that can be modified and adapted at any level of workflow. This next-gen AI model empowers users to craft video avatars for various creative fields and use cases.

Key Features Speak v2

  • It produces videos with exceptional realism, accuracy in lip-sync, and natural facial expressions
  • It supports custom image inputs and audio files (MP3 format)
  • It provides adjustable video quality settings (high/mid) for different use cases
  • It has customizable video duration options (5, 10, or 15 seconds)
  • It comes with a prompt enhancement capability for optimized expressions
  • It can render reproducible results through seed parameter control
  • It includes seamless API integration with a structured response format

Best Use Cases

  • Virtual Presenters: It is ideal to create professional videos with a spokesperson avatar for corporate communications
  • Educational Content: It can be utilized to generate engaging teacher avatars for e-learning platforms
  • Marketing Materials: It can produce customized video advertisements with consistent messaging
  • Digital Avatars: It makes it convenient to build interactive character animations for gaming and entertainment
  • Social Media Content: Creative teams can design dynamic talking-head videos for social platforms
  • Multilingual Content: It makes it possible to generate videos with synchronized translations for mass reach

Prompt Tips and Output Quality

  • Use clear, high-quality reference images for a realistic avatar animation
  • Provide detailed prompts describing desired emotional expressions and speaking style for the avatar/ video
  • Enable prompt enhancement for more balanced and natural expressions
  • Consider using "high" quality setting for client-facing content
  • Experiment with different seeds to find the most appealing animation style
  • Keep audio inputs clean and well-articulated for better lip-sync results

FAQs

How do I achieve the best lip-sync quality? High-quality audio input with clear articulation will generate better results, and enabling the high-quality setting will further enhance the result. Your reference image should have a clear, front-facing shot of the face for better visual fidelity.

Can I control the speaking style and expressions? Yes, Speak v2 gives space to control the speaking style and expressions; you can use detailed prompts and the enhance_prompt parameter to adapt these aspects. Furthermore, clearly explain the desired emotional state and speaking style in your prompt for more precise control.

What image formats work best with Speak v2? The model accepts image URLs, but high-resolution, well-lit, front-facing photos can generate even better quality output. Always provide a reference image with a clearly visible face that is centered.

How can I ensure consistent results across multiple generations? If you use the seed parameter, it will maintain consistency throughout the workflow and multiple generations. The same seed value will produce similar animation patterns (when other inputs remain unchanged).

What's the maximum video duration possible? Speak v2 supports video durations of 5, 10, or 15 seconds; longer durations are available for more extensive content needs.