POST
javascript
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 const axios = require('axios'); const api_key = "YOUR API-KEY"; const url = "https://api.segmind.com/v1/bytedance-humo"; const data = { "frames": 30, "scale_a": 5, "scale_t": 5, "mode": "TA", "height": 720, "width": 1280, "steps": 30 }; (async function() { try { const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } }); console.log(response.data); } catch (error) { console.error('Error:', error.response.data); } })();
RESPONSE
image/jpeg
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


framesint ( default: 30 )

Number of frames for the generated video

min : 10,

max : 100


scale_afloat ( default: 5 )

Strength of audio guidance. Higher = better audio-motion sync

min : 1,

max : 10


scale_tfloat ( default: 5 )

Strength of text guidance. Higher = better adherence to text prompts

min : 1,

max : 10


modeenum:str ( default: TA )

Input mode: TA for text+audio; TIA for text+image+audio.

Allowed values:


heightenum:int ( default: 720 )

Video height (e.g., 720 or 480).

Allowed values:


widthenum:int ( default: 1280 )

Video width (e.g., 1280 or 832).

Allowed values:


stepsint ( default: 30 )

min : 1,

max : 100

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

HuMo: Multimodal Human Video Generation Model

What is HuMo?

HuMo is a cutting-edge AI video generation model developed by ByteDance Research that creates high-quality, human-centric videos from multiple input types. This versatile model excels at generating detailed videos up to 1080p resolution, offering unprecedented control through text prompts, reference images, and audio inputs. Whether you need to create custom character animations, audio-synchronized performances, or scene-specific human interactions, HuMo provides a powerful framework for video synthesis.

Key Features

  • Three Flexible Generation Modes:
    • Text-to-image mode for appearance and scene customization
    • Text-to-audio mode for audio-driven video generation
    • Full multimodal mode combining text, image, and audio inputs
  • High-Resolution Output supporting up to 1080p video quality
  • Adjustable Frame Rates (30fps or 60fps) for optimal motion smoothness
  • Fine-grained Control over human poses, emotions, and scene settings
  • Audio Synchronization capabilities for matched movement and sound
  • Consistent Subject Preservation throughout generated sequences

Best Use Cases

  • Content Creation: Generate custom character videos for digital content
  • Entertainment: Create dynamic human performances synchronized with music
  • Education: Produce instructional videos with specific poses and movements
  • Marketing: Design personalized video content with controlled environments
  • Virtual Production: Generate preliminary human animations for pre-visualization

Prompt Tips and Output Quality

  • Scene Setting: Begin prompts with clear environment descriptions (e.g., "sunny beach morning" or "busy urban street")
  • Pose Guidance: Specify human poses explicitly using the 'human_pose' parameter
  • Emotional Context: Use the 'emotion' parameter to control subject expression
  • Audio Integration: Match background music to your scene's mood using provided options
  • Resolution Balance: Choose 720p for faster generation or 1080p for highest quality

FAQs

Q: How long can HuMo-generated videos be? A: Videos can range from short 10-second clips to 60-second sequences, with duration controlled via the 'duration' parameter.

Q: Can I control the background environment? A: Yes, using the 'scene' parameter, you can specify environments like 'city' or 'beach' for contextual setting.

Q: What frame rates are supported? A: HuMo supports both 30fps (standard) and 60fps (smooth motion) frame rates.

Q: How does emotion control work? A: The 'emotion' parameter allows you to specify expressions like 'happy' or 'neutral', influencing the subject's facial expressions and body language.

Q: Can I combine multiple input types? A: Yes, HuMo's multimodal architecture allows combination of text, image, and audio inputs for maximum creative control.