1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
const axios = require('axios');
const fs = require('fs');
const path = require('path');
// helper function to help you convert your local images into base64 format
async function toB64(imgPath) {
const data = fs.readFileSync(path.resolve(imgPath));
return Buffer.from(data).toString('base64');
}
const api_key = "YOUR API-KEY";
const url = "https://api.segmind.com/v1/llama-v3p2-11b-vision-instruct";
const data = {
"messages": [
{
"role": "user",
"content" : "tell me a joke on cats"
},
{
"role": "assistant",
"content" : "here is a joke about cats..."
},
{
"role": "user",
"content" : "now a joke on dogs"
},
]
};
(async function() {
try {
const response = await axios.post(url, data, { headers: { 'x-api-key': api_key } });
console.log(response.data);
} catch (error) {
console.error('Error:', error.response.data);
}
})();
An array of objects containing the role and content
Could be "user", "assistant" or "system".
A string containing the user's query or the assistant's response.
To keep track of your credit usage, you can inspect the response headers of each API call. The x-remaining-credits property will indicate the number of remaining credits in your account. Ensure you monitor this value to avoid any disruptions in your API usage.
The Llama 3.2-Vision-Instruct model is a multimodal large language model (LLM) that can process both text and images to generate text. This model is part of the Llama 3.2 family, developed by Meta, and is designed for commercial and research applications. Llama 3.2-11B Vision-Instruct is in 11B parameter size. It is optimized for visual recognition, image reasoning, captioning, and answering questions about images. The model is built upon the Llama 3.1 text-only model, incorporating a vision adapter for image processing. The model use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).
Multimodal Input: Takes both text and images as input.
Output: Generates text as output.
Image Reasoning Tasks: Supports Visual Question Answering (VQA), Document Visual Question Answering (DocVQA), Image Captioning, Image-Text Retrieval, and Visual Grounding.
Language Support: For text-only tasks, it officially supports English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. For image+text, only English is supported.
Context Length: 128
Architecture: Uses an auto-regressive language model with an optimized transformer architecture.
Vision Adapter: Employs a separately trained vision adapter consisting of cross-attention layers to integrate image encoder representations into the core LLM.
Training Data: Trained on 6 billion (image, text) pairs, with a data cutoff of December 2023. Instruction tuning data includes public vision instruction datasets and over 3 million synthetically generated examples.
Inference Scalability: Uses Grouped-Query Attention (GQA).
Commercial and research use
Visual recognition, image reasoning, captioning, and assistant-like chat with images.
Adaptable for a variety of image reasoning tasks.
Leveraging model outputs to improve other model.