POST

javascript

const axios = require('axios');

const fs = require('fs');
const path = require('path');

// helper function to help you convert your local images into base64 format
async function toB64(imgPath) {
    const data = fs.readFileSync(path.resolve(imgPath));
    return Buffer.from(data).toString('base64');
}

const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/qwen2p5-vl-32b-instruct";

const data = {
  "messages": [
    {
      "role": "user",
      "content" : "tell me a joke on cats"
    },
    {
      "role": "assistant",
      "content" : "here is a joke about cats..."
    },
    {
      "role": "user",
      "content" : "now a joke on dogs"
    },
  ]
};

(async function() {
    try {
        const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
        console.log(response.data);
    } catch (error) {
        console.error('Error:', error.response.data);
    }
})();

RESPONSE

application/json

HTTP Response Codes

200 - OKImage Generated

401 - UnauthorizedUser authentication failed

404 - Not FoundThe requested URL does not exist

405 - Method Not AllowedThe requested HTTP method is not allowed

406 - Not AcceptableNot enough credits

500 - Server ErrorServer had some issue with processing

Attributes

messagesArray

An array of objects containing the role and content

rolestr

Could be "user", "assistant" or "system".

contentstr

A string containing the user's query or the assistant's response.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Qwen2.5-VL 32B Instruct – Multimodal Large Language Model

What is Qwen2.5-VL 32B Instruct?

Qwen2.5-VL 32B Instruct is a state-of-the-art multimodal AI model from the Qwen team at Alibaba Cloud. Built on 33 billion parameters, it seamlessly processes and generates both text and image inputs, making it ideal for complex instruction-following across modalities. With an industry-leading context window of up to 125,000 tokens, Qwen2.5-VL excels at handling long documents, extended conversations, and deep multi-step reasoning. The model supports fine-tuning on domain-specific data and offers serverless deployment for automatic scaling and low-latency inference.

Key Features

33 Billion Parameters: Robust neural architecture for nuanced language and vision understanding.
125,000-Token Context: Best-in-class context length to capture full conversations, legal documents, and codebases.
Multimodal Fusion: Joint embedding space for text and images enables tasks like visual question answering and content summarization.
Instruction-Fine-Tuning: Pre-tuned on instruction datasets to follow user prompts accurately.
Serverless Deployment: Instant scaling and simplified API management for production workloads.
Versatile Output: Rich text generation, step-by-step explanations, image captioning, and more.

Best Use Cases

Advanced Chatbots: Build customer support agents that understand screenshots, scans, and long chat histories.
Document Understanding: Summarize reports, extract key facts, and answer questions from PDF or HTML.
Visual Question Answering: Analyze diagrams or photos to provide descriptions, insights, and annotations.
Multimodal Content Generation: Create interactive tutorials combining text, code snippets, and images.
Knowledge Retrieval: Search and reason over enterprise data vaults or research archives.
Instructional AI: Develop tutoring systems that accept textbook excerpts and illustrations.

Prompt Tips and Output Quality

Be Explicit: Start with “Analyze this image…” or “Summarize the following text…” to guide the model’s objective.
Leverage Context: Provide longer context windows when working with large documents or multi-turn dialogues.
Image Clarity: Use high-resolution, well-lit images for accurate visual reasoning.
Step-by-Step Instructions: Break complex tasks into numbered steps in your prompt.
Iterate and Refine: Review outputs, adjust prompt phrasing, and re-submit to improve response quality.
Combine Modalities: Pair text instructions with relevant images to unlock richer, multimodal insights.

FAQs

Q: What types of inputs does Qwen2.5-VL 32B support?
A: It accepts free-form text prompts and image URLs or binary data for analysis and generation tasks.

Q: How long is the maximum context length?
A: Up to 125,000 tokens, enabling the processing of entire books, code repositories, or lengthy legal contracts.

Q: Can I fine-tune Qwen2.5-VL 32B on my own data?
A: Yes. The model provides a fine-tuning API that tailors responses to your domain, style, or industry vocabulary.

Q: Is serverless deployment available?
A: Absolutely—deploy Qwen2.5-VL via serverless endpoints that handle auto-scaling and reduce operational overhead.

Q: What are common applications for Qwen2.5-VL?
A: Popular use cases include multimodal chatbots, document QA, image captioning, code analysis, and research summarization.

Popular Models

SDXL Img2Img SDXL Img2Img is used for text-guided image-to-image translation. This model uses the weights from Stable Diffusion to generate new images from an input image using StableDiffusionImg2ImgPipeline from diffusers

Faceswap V2 Take a picture/gif and replace the face in it with a face of your choice. You only need one image of the desired face. No dataset, no training

SDXL Inpaint This model is capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask

Codeformer CodeFormer is a robust face restoration algorithm for old photos or AI-generated faces.