Prompt

Visual: São Paulo, Brazil – a narrow street packed with vibrant murals in neon greens, reds, and yellows. The atmosphere is electric: street vendors, kids playing football, and local dancers vibing in the background. The city feels loud, alive, and unapologetic. Subject: A Black male rapper in raw streetwear — fitted cap, graphic tee, worn sneakers, simple chain. He stands front and center, locked in with the camera, body moving naturally to the rhythm, commanding the street like it’s his stage. Audio: [Male rapper, high-energy, gritty voice] Rapping over a hard street beat blended with Brazilian percussion: “Painted walls talk loud, yeah, they know my name, Came up from the block where the heat stay flame. City never soft, had to hustle, stay sharp, Turned struggle into fire, now I’m leavin’ my mark. Bass hit heavy, feel the ground when I step, From these streets to the world, yeah, I’m reppin’ respect.” Background: Heavy bass layered with live drum hits, claps, and subtle street noise — bikes passing, distant voices, raw urban texture. Camera: Rapid cuts between tight close-ups of his face and hands, wide shots of the colorful murals, quick flashes of dancers and street life. Handheld movement keeps it gritty and real, ending on a strong close-up as the beat hits.

Image URL

Click or Drag-n-Drop

You can drop your own file here

Negative Prompt

CFG Scale

Duration

Mode

Kling 2.6 Pro: Image-to-Video Model (with Native Audio)

What is Kling 2.6 Pro?

Kling 2.6 Pro is an advanced image-to-video generative AI model that turns a single still image plus a text prompt into a short, cinematic clip—with integrated, native audio. Provide an image_url as the visual starting frame and describe the scene in prompt; the model generates smooth motion, coherent visuals, and synchronized sound design (voice-like audio cues, ambience, and effects) to match the on-screen action.

It’s designed for teams who need fast, production-ready video generation via REST API, especially when you want the output to feel like a complete “scene” instead of silent B-roll.

Key Features

Image-to-video generation from one reference image (image_url)
Native audio generation with generate_audio for richer, immersive clips
Cinematic motion + realism: smooth camera movement and consistent scene continuity
Duration control: generate 5s or 10s videos (duration)
Prompt steering + cleanup: prompt and negative_prompt for precision
CFG control: cfg_scale (0–1) to balance prompt adherence vs. natural motion
Aspect ratios: 16:9, 9:16, 1:1 (aspect_ratio) for web, social, and product
Pro mode: set mode="pro" for polished outputs

Best Use Cases

Product explainers & promos: animate packshots into lifestyle motion with ambience
Cinematic social content: vertical 9:16 reels with camera moves and sound
Storytelling shorts: consistent scene building from a keyframe image
Brand mood films: landscapes, interiors, and stylized establishing shots
App/landing page hero media: loopable 5–10s visuals with audio atmosphere

Prompt Tips and Output Quality

Start with a clear scene + action + camera: “slow dolly-in”, “handheld”, “crane up”.
Add lighting and mood: “golden hour”, “neon reflections”, “soft fog”.
Use negative_prompt to prevent artifacts: “no distortion, no jitter, no clutter”.
Tune cfg_scale:
- Higher (~0.7–1.0): stronger style/detail adherence
- Lower (~0.3–0.6): more natural motion, fewer “over-directed” frames
Pick aspect_ratio early (e.g., 16:9 cinematic, 9:16 social) to avoid reframing.
Enable generate_audio=true when you want the scene to feel complete (ambience + SFX).

FAQs

Is Kling 2.6 Pro text-to-video?
It’s primarily image-to-video: you can include image_url as the visual anchor plus a detailed prompt.

Does it generate audio automatically?
Yes—set generate_audio to true to include native audio alongside the video.

What video lengths are supported?
duration supports "5" or "10" seconds.

How is it different from other image-to-video models?
Its standout is native audio generation plus coherent cinematic motion from a single image.

What parameters should I tweak first for best results?
Start with prompt + negative_prompt, then adjust cfg_scale, duration, and aspect_ratio.

What mode should I use?
Use mode="pro" (the available option) for high-quality, polished outputs.

Popular Models

SDXL Img2Img SDXL Img2Img is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers

SadTalker Audio-based Lip Synchronization for Talking Head Video

Fooocus Fooocus enables high-quality image generation effortlessly, combining the best of Stable Diffusion and Midjourney.

Faceswap Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training