POST
javascript
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 const axios = require('axios'); const fs = require('fs'); const path = require('path'); // helper function to help you convert your local images into base64 format async function toB64(imgPath) { const data = fs.readFileSync(path.resolve(imgPath)); return Buffer.from(data).toString('base64'); } const api_key = "YOUR API-KEY"; const url = "https://api.segmind.com/v1/qwen2p5-vl-32b-instruct"; const data = { "messages": [ { "role": "user", "content" : "tell me a joke on cats" }, { "role": "assistant", "content" : "here is a joke about cats..." }, { "role": "user", "content" : "now a joke on dogs" }, ] }; (async function() { try { const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } }); console.log(response.data); } catch (error) { console.error('Error:', error.response.data); } })();
RESPONSE
application/json
HTTP Response Codes
200 - OKImage Generated
401 - UnauthorizedUser authentication failed
404 - Not FoundThe requested URL does not exist
405 - Method Not AllowedThe requested HTTP method is not allowed
406 - Not AcceptableNot enough credits
500 - Server ErrorServer had some issue with processing

Attributes


messagesArray

An array of objects containing the role and content


rolestr

Could be "user", "assistant" or "system".


contentstr

A string containing the user's query or the assistant's response.

To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.

Qwen2.5-VL 32B Instruct – Multimodal Large Language Model

What is Qwen2.5-VL 32B Instruct?

Qwen2.5-VL 32B Instruct is a state-of-the-art multimodal AI model from the Qwen team at Alibaba Cloud. Built on 33 billion parameters, it seamlessly processes and generates both text and image inputs, making it ideal for complex instruction-following across modalities. With an industry-leading context window of up to 125,000 tokens, Qwen2.5-VL excels at handling long documents, extended conversations, and deep multi-step reasoning. The model supports fine-tuning on domain-specific data and offers serverless deployment for automatic scaling and low-latency inference.

Key Features

  • 33 Billion Parameters: Robust neural architecture for nuanced language and vision understanding.
  • 125,000-Token Context: Best-in-class context length to capture full conversations, legal documents, and codebases.
  • Multimodal Fusion: Joint embedding space for text and images enables tasks like visual question answering and content summarization.
  • Instruction-Fine-Tuning: Pre-tuned on instruction datasets to follow user prompts accurately.
  • Serverless Deployment: Instant scaling and simplified API management for production workloads.
  • Versatile Output: Rich text generation, step-by-step explanations, image captioning, and more.

Best Use Cases

  • Advanced Chatbots: Build customer support agents that understand screenshots, scans, and long chat histories.
  • Document Understanding: Summarize reports, extract key facts, and answer questions from PDF or HTML.
  • Visual Question Answering: Analyze diagrams or photos to provide descriptions, insights, and annotations.
  • Multimodal Content Generation: Create interactive tutorials combining text, code snippets, and images.
  • Knowledge Retrieval: Search and reason over enterprise data vaults or research archives.
  • Instructional AI: Develop tutoring systems that accept textbook excerpts and illustrations.

Prompt Tips and Output Quality

  1. Be Explicit: Start with “Analyze this image…” or “Summarize the following text…” to guide the model’s objective.
  2. Leverage Context: Provide longer context windows when working with large documents or multi-turn dialogues.
  3. Image Clarity: Use high-resolution, well-lit images for accurate visual reasoning.
  4. Step-by-Step Instructions: Break complex tasks into numbered steps in your prompt.
  5. Iterate and Refine: Review outputs, adjust prompt phrasing, and re-submit to improve response quality.
  6. Combine Modalities: Pair text instructions with relevant images to unlock richer, multimodal insights.

FAQs

Q: What types of inputs does Qwen2.5-VL 32B support?
A: It accepts free-form text prompts and image URLs or binary data for analysis and generation tasks.

Q: How long is the maximum context length?
A: Up to 125,000 tokens, enabling the processing of entire books, code repositories, or lengthy legal contracts.

Q: Can I fine-tune Qwen2.5-VL 32B on my own data?
A: Yes. The model provides a fine-tuning API that tailors responses to your domain, style, or industry vocabulary.

Q: Is serverless deployment available?
A: Absolutely—deploy Qwen2.5-VL via serverless endpoints that handle auto-scaling and reduce operational overhead.

Q: What are common applications for Qwen2.5-VL?
A: Popular use cases include multimodal chatbots, document QA, image captioning, code analysis, and research summarization.