Unveiling Qwen3-ASR: A Robust, Production-Ready Open Source Speech Family

By the Technical Team

In the rapidly evolving landscape of audio intelligence, the gap between academic research models and production-ready tools is often significant. Real-world audio is rarely clean; it is filled with interruptions, background noise, diverse accents, and overlapping speech.

Today, we are pleased to announce the open-source release of Qwen3-ASR and Qwen3-ForcedAligner. These models have been engineered not just for benchmarks, but for the messy reality of practical application. Designed to deliver competitive performance and exceptional robustness, this family of models aims to provide developers and researchers with a powerful toolkit for next-generation speech recognition tasks.

Breaking Down Language Barriers

One of the primary challenges in global speech recognition is the nuance of local dialects. Standard models often perform well on "broadcast quality" speech but struggle with regional variations.

Qwen3-ASR addresses this by supporting a total of 52 languages and dialects. The system features automatic Language Identification (LID), seamlessly distinguishing between:

  • 30 core languages, ensuring broad global coverage.
  • 22 specific dialects and accents, capturing the subtleties of regional speech patterns that general models often miss.

This capability makes Qwen3-ASR a versatile choice for applications requiring multi-lingual support without the need for manual language switching.

Engineered for "Messy" Audio and Music

Ideally, Automatic Speech Recognition (ASR) input would always be recorded in a soundproof studio. In reality, users record audio in cafes, on windy streets, or while music is playing.

We focused heavily on robustness during the training of Qwen3-ASR. The model demonstrates stability in complex acoustic environments where noise floors are high. Notably, this robustness extends to the domain of music. Qwen3-ASR is capable of transcribing singing voices and songs, maintaining accuracy even when vocals are intertwined with instrumental backing tracks. This opens up exciting possibilities for lyric transcription and media analysis workflows.

Long-Form Audio Support

Context is key in transcription. Many ASR systems require long audio files to be chunked into very short segments (often 30 seconds) to function, which can lead to disjointed outputs and context loss across segment boundaries.

Qwen3-ASR supports long audio inputs, capable of processing up to 20 minutes of audio per pass. This extended context window allows the model to maintain better coherence over long conversations, meetings, or speeches, reducing the complexity required in pre-processing pipelines.

Next-Generation Forced Alignment

alongside the core ASR model, we are introducing Qwen3-ForcedAligner.

For applications requiring precise synchronization between text and audio—such as subtitle generation, karaoke apps, or phonological research—standard ASR outputs are often insufficient. Qwen3-ForcedAligner provides word and phrase-level timestamps with high precision across 11 supported languages.

Technical evaluations indicate that this model offers stronger alignment performance compared to traditional methods like MFA (Montreal Forced Aligner), CTC-based alignment, or CIF-style aligners. It represents a significant step forward for developers who need exact temporal data alongside their transcripts.

A Complete Inference Stack

Releasing model weights is only half the battle; providing a usable software stack is equally important. We are releasing a full open-source inference and fine-tuning stack designed for modern deployment needs.

Key features of the toolkit include:

  • vLLM Integration: Leveraging vLLM for high-throughput batch inference.
  • Streaming Support: Enabling real-time transcription applications.
  • Async Serving: Facilitating scalable backend architectures.

This comprehensive stack ensures that Qwen3-ASR can be integrated into production environments with minimal friction.

Getting Started

We invite the community to explore, evaluate, and build upon the Qwen3-ASR family. The models and code are available now across major platforms.

Code and Documentation:

Model Downloads:

Interactive Demos:

We look forward to seeing how the community utilizes these tools to advance the field of speech processing.