Last Updated: 11 Aug 2025
Qwen-Image was developed and trained by Alibaba, as a part of the broader Qwen series of large language models. The model was officially released in early August 2025. Qwen series of models primarily were LLMs that could input text and image models and output text. With Qwen-Image, the model can now generate images as well. This new model is great at complex text rendering. The Qwen image to image version (comming soon) is great at precise image editing tasks.
Qwen-Image is a **20-billion parameter model using Multimodal Diffusion Transformer (MMDiT) architecture. The architecture consists of three key components working in tandem
This model leverages a comprehensive data pipeline, progressive training strategies, and enhanced multi-task learning to achieve state-of-the-art results across multiple benchmarks.
Advanced Text Rendering: It is able to generate both English and Chinese alphabets and render them on the image. It supports simple words to multi-line rendering and even paragraph level rendering. Prior to this model, only GPT-Image from OpenAI was capable of text rendering with this precision.
Qwen-Image combines powerful general image generation capability with unmatched text rendering precision in English and Chinese. The prompt adherence is also state-of-the-art. As of today, this model is the leading open-source multimodal foundation model bridging artistic flexibility, textual accuracy, and robust editing capabilities.