Qwen3-ASR: Free Online AI Speech-to-Text & Forced Alignment Tool
An all-in-one platform for Tongyi Qwen3-ASR. Access the 1.7B and 0.6B models for fast multilingual transcription, dialect recognition, and word-level timestamp alignment powered by vLLM.
Massive Multilingual Support (52+ Languages)
Break language barriers with the Tongyi Qwen3-ASR family. This model supports automatic language identification and speech recognition for 52 languages, including English, Japanese, Korean, French, and extensive coverage of 22 Chinese dialects (Cantonese, Sichuanese, etc.). It delivers robust performance even with mixed-language audio and complex acoustic environments.
Precise Word-Level Forced Alignment
Achieve professional-grade synchronization with the Qwen3-ForcedAligner-0.6B. Unlike traditional ASR, this specialized model aligns text-speech pairs with exceptional accuracy, providing word and character-level timestamps for 11 major languages. It is perfect for generating subtitles, karaoke lyrics, and analyzing speech data with high temporal precision.
Real-Time Streaming Inference
Experience low-latency transcription designed for real-world applications. Leveraging the vLLM backend, Qwen3-ASR supports unified offline and streaming inference. Whether you are processing live meeting audio or building a voice assistant, the model delivers immediate text output with high throughput, ensuring a seamless user experience.
Robust Handling of Accents & Dialects
Tongyi Qwen3-ASR sets a new standard for dialect robustness. It is trained to recognize diverse English accents and specific Chinese regional dialects (such as Wu, Minnan, and Dongbei) that often challenge other models. This ensures that speakers from different regions are understood accurately without needing fine-tuning.
State-of-the-Art Performance & Efficiency
Ideally balanced for speed and accuracy. The Qwen3-ASR-1.7B model achieves top-tier results on OpenASR benchmarks, rivaling proprietary commercial APIs. Meanwhile, the lightweight 0.6B version offers incredible efficiency—capable of 2000x throughput at high concurrency—making it accessible for local deployment on consumer hardware.
Video Subtitling & Captioning
Automatically generate perfectly timed subtitles for videos in over 50 languages using the Forced Aligner to sync text with audio frames.
Global Meeting Transcription
Transcribe international business meetings with mixed languages. Identify speakers' languages automatically and produce accurate meeting minutes.
Dialect Analysis & Research
A valuable tool for linguists and researchers working with specific Chinese dialects or regional English accents that are unsupported by standard ASR tools.
Voice Assistants & Chatbots
Integrate the streaming capability to power responsive voice interfaces that understand user commands instantly with low latency.
Karaoke & Music Lyrics
Utilize the timestamp prediction to align lyrics with songs (supports singing voice), creating synchronized karaoke experiences.
Accessibility Services
Provide real-time captions for the hearing impaired, ensuring digital content is accessible across different languages and accents.
