Qwen3 TTS Install Guide and Model Overview

In the rapidly evolving landscape of speech synthesis, finding a model that balances speed, emotional expressiveness, and multilingual support can be a challenge. Today, we are exploring Qwen3-TTS, a robust speech generation model that has recently caught the attention of the developer community.

This post will serve as a detailed guide on what makes this model special, followed by a step-by-step Qwen3 TTS install and usage tutorial. Whether you are looking to integrate real-time voice synthesis into an application or simply experiment with voice cloning, this guide covers the essentials.

What is Qwen3-TTS?

Qwen3-TTS is a versatile text-to-speech framework designed to meet global application needs. It supports 10 major languages, including English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian, along with various dialectal profiles.

Beyond simple translation, the model excels at contextual understanding. It can adapt its tone, speaking rate, and emotional expression based on the semantic meaning of the text and specific user instructions.

Key Technical Features

Universal End-to-End Architecture: Unlike traditional pipelines that rely on a Language Model (LM) followed by a Diffusion Transformer (DiT), Qwen3-TTS utilizes a discrete multi-codebook LM architecture. This allows it to bypass common information bottlenecks and cascading errors, resulting in higher generation efficiency.
Ultra-Low Latency: For developers building real-time interactions, this is a standout feature. The model uses a "Dual-Track" hybrid streaming architecture. It can output audio immediately after the first character is processed, achieving end-to-end latency as low as 97ms.
Powerful Speech Representation: Powered by the custom Qwen3-TTS-Tokenizer-12Hz, the model achieves high-fidelity speech reconstruction. It effectively captures paralinguistic information (like breath and tone) and environmental acoustics.
Intelligent Control: The model supports natural language instructions. You can explicitly tell the model to sound "angry," "whispering," or "joyful," and it will adjust the prosody accordingly.

Choosing the Right Model

Before we jump into the installation, it is helpful to know which version of the model suits your needs. The weights are generally available in 0.6B (lightweight) and 1.7B (higher quality) sizes.

CustomVoice Models: These come with 9 premium, pre-set timbres (speakers) covering various genders and ages. They allow for style control via instructions.
VoiceDesign Models: These allow you to "design" a voice from scratch using a natural language description (e.g., "A deep, rasping male voice").
Base Models: These are designed for Zero-Shot Voice Cloning. By providing a 3-second audio clip, the model can clone the voice. This version is also suitable for fine-tuning.

Qwen3 TTS Install Guide

Getting Qwen3-TTS up and running is straightforward, thanks to the provided Python package. We recommend using a Linux environment with NVIDIA GPUs for optimal performance.

Step 1: Environment Setup

To ensure a clean installation without dependency conflicts, we highly recommend creating a fresh Conda environment using Python 3.12.

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts

Step 2: Install the Package

The easiest way to install the necessary libraries and runtime dependencies is via PyPI:

pip install -U qwen-tts

If you prefer to work with the source code directly (useful for development or debugging), you can clone the repository:

git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS
pip install -e .

Step 3: Optimization (Recommended)

To reduce GPU memory usage and improve inference speed, installing FlashAttention 2 is strongly advised.

pip install -U flash-attn --no-build-isolation

Note for Lower-Resource Machines: If your system has limited RAM (less than 96GB) but a high core count, the compilation process for FlashAttention might crash. You can limit the number of parallel jobs to prevent this:

MAX_JOBS=4 pip install -U flash-attn --no-build-isolation

Reminder: FlashAttention 2 requires the model to be loaded in torch.float16 or torch.bfloat16.

Downloading Model Weights

The qwen-tts package handles model downloads automatically from Hugging Face when you first run the code. However, if you are in an environment with restricted internet access or wish to manage weights manually, you can download them beforehand.

Using Hugging Face CLI:

# Download through ModelScope (recommended for users in Mainland China)
pip install -U modelscope
modelscope download --model Qwen/Qwen3-TTS-Tokenizer-12Hz  --local_dir ./Qwen3-TTS-Tokenizer-12Hz 
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local_dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local_dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --local_dir ./Qwen3-TTS-12Hz-1.7B-Base
modelscope download --model Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local_dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
modelscope download --model Qwen/Qwen3-TTS-12Hz-0.6B-Base --local_dir ./Qwen3-TTS-12Hz-0.6B-Base

# Download through Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./Qwen3-TTS-Tokenizer-12Hz
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./Qwen3-TTS-12Hz-1.7B-Base
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base --local-dir ./Qwen3-TTS-12Hz-0.6B-Base

You can repeat this command for other variants like Qwen3-TTS-12Hz-1.7B-VoiceDesign or Qwen3-TTS-12Hz-0.6B-CustomVoice depending on your use case.

Using Qwen3-TTS in Python

Once installed, usage is intuitive. Below are examples for the three main use cases.

1. Generating with Custom Voice (Preset Speakers)

This is the simplest method. You select a pre-defined speaker (such as "Ryan" for English or "Vivian" for Chinese) and provide text.

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

# Load the model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Generate audio: single inference
wavs, sr = model.generate_custom_voice(
    text="I noticed that I am particularly good at observing other people's emotions.",
    language="English", 
    speaker="Ryan",  # Supported speakers include Vivian, Serena, Ryan, Aiden, etc.
    instruct="Speak in a very calm and analytical tone."
)

# Save the output
sf.write("output_custom.wav", wavs[0], sr)

# batch inference
wavs, sr = model.generate_custom_voice(
    text=[
        "In fact, I have really discovered that I am a person who is particularly good at observing others' emotions.", 
        "She said she would be here by noon."
    ],
    language=["Chinese", "English"],
    speaker=["Vivian", "Ryan"],
    instruct=["", "Very happy."]
)
sf.write("output_custom_voice_1.wav", wavs[0], sr)
sf.write("output_custom_voice_2.wav", wavs[1], sr)

2. Voice Design (Text-to-Voice Creation)

If you don't want to use a preset speaker, you can describe the voice you want.

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# single inference
wavs, sr = model.generate_voice_design(
    text="Brother, you're back! I've been waiting for you for so, so long! Give me a hug!",
    language="Chinese",
    instruct="This is a sweet and childish loli-like female voice, with a high pitch and noticeable fluctuations, creating a clingy, affected, and deliberately cute auditory effect.",
)
sf.write("output_voice_design.wav", wavs[0], sr)

# batch inference
wavs, sr = model.generate_voice_design(
    text=[
      "Brother, you're back! I've been waiting for you for so, so long! Give me a hug!",
      "It's in the top drawer... wait, it's empty? No way, that's impossible! I'm sure I put it there!"
    ],
    language=["Chinese", "English"],
    instruct=[
      "This is a sweet and childish loli-like female voice, with a high pitch and noticeable fluctuations, creating a clingy, affected, and deliberately cute auditory effect.",
      "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice."
    ]
)
sf.write("output_voice_design_1.wav", wavs[0], sr)
sf.write("output_voice_design_2.wav", wavs[1], sr)

3. Voice Cloning (Base Model)

This is one of the most powerful features. You provide a reference audio file (ref_audio) and its transcript (ref_text), and the model clones the voice to speak new text.

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Define reference audio (can be a URL or local path)
ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
ref_text  = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."

# Generate new speech
wavs, sr = model.generate_voice_clone(
    text="The calculations are complete, and the results are fascinating!",
    language="English",
    ref_audio=ref_audio,
    ref_text=ref_text,
)
sf.write("output_clone.wav", wavs[0], sr)

Pro Tip: If you plan to generate many sentences with the same cloned voice, you should use model.create_voice_clone_prompt first. This extracts the voice features once and reuses them, saving significant computation time.

Running the Local Web UI Demo

For users who prefer a graphical interface over code, Qwen3-TTS includes a built-in demo based on Gradio.

To launch the demo for the Voice models:

# CustomVoice model
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
# VoiceDesign model
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
# Base model
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000

Important: HTTPS for Base Model

If you are running the Base (Voice Clone) model, browsers often block microphone access on non-secure (HTTP) connections. If you are accessing the server remotely, you must enable HTTPS.

Generate a self-signed certificate:

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=localhost"

Run the demo with SSL:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --ip 0.0.0.0 --port 8000 \
  --ssl-certfile cert.pem \
  --ssl-keyfile key.pem \
  --no-ssl-verify

You can then access the interface at https://<your-ip>:8000.

Alternative: vLLM Support

It is also worth noting that vLLM (specifically the vLLM-Omni project) provides Day-0 support for Qwen3-TTS. This is excellent for production deployment as it offers optimized inference speeds. Currently, vLLM supports offline inference, with online serving and streaming capabilities planned for future updates.

Conclusion

Qwen3-TTS represents a significant step forward in open-weight speech synthesis. Its combination of low latency, instructional control, and high-fidelity cloning makes it a compelling choice for developers. Whether you are building an interactive AI agent or simply exploring audio generation, the installation process is accessible and the results are impressive.

We hope this Qwen3 TTS install guide helps you get started. Happy coding!