POST

javascript

const axios = require('axios');

const fs = require('fs');
const path = require('path');

// helper function to help you convert your local images into base64 format
async function toB64(imgPath) {
    const data = fs.readFileSync(path.resolve(imgPath));
    return Buffer.from(data).toString('base64');
}

const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/qvq-max";

const data = {
  "messages": [
    {
      "role": "user",
      "content" : "tell me a joke on cats"
    },
    {
      "role": "assistant",
      "content" : "here is a joke about cats..."
    },
    {
      "role": "user",
      "content" : "now a joke on dogs"
    },
  ]
};

(async function() {
    try {
        const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
        console.log(response.data);
    } catch (error) {
        console.error('Error:', error.response.data);
    }
})();

RESPONSE

application/json

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

messagesArray

An array of objects containing the role and content

rolestr

Could be "user", "assistant" or "system".

contentstr

A string containing the user's query or the assistant's response.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

QVQ Max — Visual Reasoning Model API

What is QVQ Max?

QVQ Max is Alibaba Cloud's flagship visual reasoning model, developed by the Qwen team as the production successor to QVQ-72B-Preview. It is purpose-built for tasks that require extended logical reasoning over visual information — not just recognizing what is in an image, but thinking through it step-by-step to arrive at a well-reasoned answer.

Unlike general vision-language models that return instant direct answers, QVQ Max is a thinking-only model: every response begins with a transparent chain-of-thought reasoning process, visible in real time via streaming output. This makes it uniquely suited for complex visual problem-solving in education, research, analytics, and enterprise AI applications where accuracy and explainability are critical.

The model accepts text, images, and video frames within a 131,072-token context window, and always outputs its reasoning chain alongside the final answer.

Key Features

Always-on chain-of-thought reasoning — the model thinks through every problem before answering; reasoning is streamed in real time
131K token context window — supports long multi-image prompts, video frames, and complex visual documents
Multi-image support — compare, contrast, and reason across multiple images in a single call
Video understanding — extract sequential reasoning from video frames with temporal grounding
Streaming-only output — responses are delivered progressively, ideal for interactive and real-time applications
High accuracy on STEM problems — excels at mathematics, physics, and chart interpretation tasks requiring inference, not just recognition

Best Use Cases

STEM Education and Tutoring: Solve math, physics, and chemistry problems from textbook diagrams, handwritten notes, or whiteboard photos — with every reasoning step shown.

Data Analysis and Business Intelligence: Interpret complex charts, financial dashboards, and multi-variable graphs with detailed written explanations of trends and anomalies.

Scientific and Medical Research: Analyze annotated figures, experimental data plots, pathology slides, or research paper diagrams requiring domain-aware reasoning.

Engineering and Architecture Review: Read circuit schematics, system architecture diagrams, or CAD-style drawings and reason about design decisions or implementation steps.

Visual QA for Enterprise: Audit technical documents, annotated maps, compliance screenshots, or instructional manuals where inference over visual content is required.

Code and UI Analysis: Analyze screenshots of interfaces, error outputs, or code diagrams and reason about bugs, design improvements, or implementation steps.

Prompt Tips and Output Quality

For best results with QVQ Max, frame your prompts as explicit questions that require reasoning, not just description. Instead of asking what is in this image, ask what mathematical principle the diagram demonstrates and to solve the problem step by step. The model performs best when it knows what conclusion to reason toward.

Since QVQ Max always shows its reasoning chain, you can reference intermediate steps in follow-up prompts. This makes it ideal for interactive, multi-turn analytical sessions.

For video inputs, specify the time range or frame numbers you want analyzed. For multi-image comparisons, number your images in the prompt and ask for explicit comparisons.

Be aware that thinking tokens are billed as output — for simple tasks, consider Qwen3 VL Flash as a more cost-effective alternative.

FAQs

Is the chain-of-thought reasoning optional? No — QVQ Max is a thinking-only model. Reasoning is always generated and cannot be disabled. This is by design, ensuring high accuracy on complex tasks.

Does QVQ Max support video inputs? Yes — it can process video frames within its 131K context window, with temporal reasoning to understand sequences and events over time.

How does QVQ Max compare to Qwen3 VL Flash? QVQ Max is optimized for deep, step-by-step reasoning on complex tasks and is always in thinking mode. Qwen3 VL Flash offers optional thinking mode at significantly lower cost, making it better for high-volume or simpler visual tasks.

Why is QVQ Max streaming-only? Extended reasoning chains make streaming the natural output method — you see the model's thinking as it happens, which is valuable for interactive applications and educational use cases.

What image formats are supported? QVQ Max accepts URLs and base64-encoded images in PNG, JPEG, and WebP formats. Multiple images can be included in a single prompt.

Is QVQ Max suitable for real-time production workloads? It depends on the use case. For tasks requiring deep reasoning accuracy, QVQ Max is excellent. For high-throughput batch processing, Qwen3 VL Flash is more appropriate.

Popular Models

SDXL Img2Img SDXL Img2Img is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers

SDXL Controlnet SDXL ControlNet gives unprecedented control over text-to-image generation. SDXL ControlNet models Introduces the concept of conditioning inputs, which provide additional information to guide the image generation process

Llama 3 8b Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

Faceswap V2 Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training