微软刚刚开源的VibeVoice

前沿情报 · 昨天 13:21

微软昨天发布了最新的开源TTS模型VibeVoice
今天下班早找个测试机搭了个环境，分享给L站里有需要的佬们。

1、物理机配置:i7-13700KF,DDR5 128G,Nvidia-3090（24G显存）
2、操作系统:Ubuntu 22.04.5 LTS
3、推理环境:Python 3.10.18
4、模型地址:WestZhang/VibeVoice-Large-pt · Hugging Face
这是7B大杯版，如果有显存8G可以搞1.5B的中杯版还有个0.5B的小杯等待发布中
5、目录结构:

(vibevoice) root@dev-3090:/workspace/VibeVoice# tree
.
├── demo
│ ├── example
│ │ ├── 1p_EN2CH.mp4
│ │ ├── 2p_see_u_again.mp4
│ │ └── 4p_climate_45min.mp4
│ ├── gradio_demo.py
│ ├── inference_from_file.py
│ ├── text_examples
│ │ ├── 1p_abs.txt
│ │ ├── 1p_Ch2EN.txt
│ │ ├── 2p_goat.txt
│ │ ├── 2p_music.txt
│ │ ├── 2p_short.txt
│ │ ├── 2p_yayi.txt
│ │ ├── 3p_gpt5.txt
│ │ ├── 4p_climate_100min.txt
│ │ └── 4p_climate_45min.txt
│ └── voices
│ ├── en-Alice_woman.wav
│ ├── en-Carter_man.wav
│ ├── en-Frank_man.wav
│ ├── en-Mary_woman_bgm.wav
│ ├── en-Maya_woman.wav
│ ├── in-Samuel_man.wav
│ ├── zh-Anchen_man_bgm.wav
│ ├── zh-Bowen_man.wav
│ └── zh-Xinran_woman.wav
├── Figures
│ ├── Google_AI_Studio_2025-08-25T21_48_13.452Z.png
│ ├── MOS-preference.png
│ └── VibeVoice.jpg
├── flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
├── LICENSE
├── pyproject.toml
├── README.md
├── SECURITY.md
├── vibevoice
│ ├── configs
│ │ ├── qwen2.5_1.5b_64k.json
│ │ └── qwen2.5_7b_32k.json
│ ├── __init__.py
│ ├── modular
│ │ ├── configuration_vibevoice.py
│ │ ├── __init__.py
│ │ ├── modeling_vibevoice_inference.py
│ │ ├── modeling_vibevoice.py
│ │ ├── modular_vibevoice_diffusion_head.py
│ │ ├── modular_vibevoice_text_tokenizer.py
│ │ ├── modular_vibevoice_tokenizer.py
│ │ ├── __pycache__
│ │ │ ├── configuration_vibevoice.cpython-310.pyc
│ │ │ ├── __init__.cpython-310.pyc
│ │ │ ├── modeling_vibevoice.cpython-310.pyc
│ │ │ ├── modeling_vibevoice_inference.cpython-310.pyc
│ │ │ ├── modular_vibevoice_diffusion_head.cpython-310.pyc
│ │ │ ├── modular_vibevoice_text_tokenizer.cpython-310.pyc
│ │ │ ├── modular_vibevoice_tokenizer.cpython-310.pyc
│ │ │ └── streamer.cpython-310.pyc
│ │ └── streamer.py
│ ├── processor
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ │ ├── __init__.cpython-310.pyc
│ │ │ ├── vibevoice_processor.cpython-310.pyc
│ │ │ └── vibevoice_tokenizer_processor.cpython-310.pyc
│ │ ├── vibevoice_processor.py
│ │ └── vibevoice_tokenizer_processor.py
│ ├── __pycache__
│ │ └── __init__.cpython-310.pyc
│ ├── schedule
│ │ ├── dpm_solver.py
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ │ ├── dpm_solver.cpython-310.pyc
│ │ │ └── __init__.cpython-310.pyc
│ │ └── timestep_sampler.py
│ └── scripts
│ ├── convert_nnscaler_checkpoint_to_transformers.py
│ └── __init__.py
├── vibevoice.egg-info
│ ├── dependency_links.txt
│ ├── PKG-INFO
│ ├── requires.txt
│ ├── SOURCES.txt
│ └── top_level.txt
└── WestZhang
├── config.json
├── model-00001-of-00010.safetensors
├── model-00002-of-00010.safetensors
├── model-00003-of-00010.safetensors
├── model-00004-of-00010.safetensors
├── model-00005-of-00010.safetensors
├── model-00006-of-00010.safetensors
├── model-00007-of-00010.safetensors
├── model-00008-of-00010.safetensors
├── model-00009-of-00010.safetensors
├── model-00010-of-00010.safetensors
├── model.safetensors.index.json
└── preprocessor_config.json
17 directories, 82 files

复制代码

git拉完项目正常按照install方法安装的，不过需要自己再单独装下flash-attention最后推理环境的py扩展如下：

pip list
Package Version Editable project location
------------------------ ----------- -------------------------
absl-py 2.3.1
accelerate 1.6.0
aiofiles 24.1.0
aioice 0.10.1
aiortc 1.13.0
annotated-types 0.7.0
anyio 4.10.0
audioread 3.0.1
av 14.4.0
Brotli 1.1.0
certifi 2025.8.3
cffi 1.17.1
charset-normalizer 3.4.3
click 8.2.1
cryptography 45.0.6
decorator 5.2.1
diffusers 0.35.1
dnspython 2.7.0
einops 0.8.1
exceptiongroup 1.3.0
fastapi 0.116.1
ffmpy 0.6.1
filelock 3.19.1
flash_attn 2.8.3
fsspec 2025.7.0
google-crc32c 1.7.1
gradio 5.44.0
gradio_client 1.12.1
groovy 0.1.2
h11 0.16.0
hf-xet 1.1.8
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.34.4
idna 3.10
ifaddr 0.2.0
importlib_metadata 8.7.0
Jinja2 3.1.6
joblib 1.5.1
lazy_loader 0.4
librosa 0.11.0
llvmlite 0.44.0
markdown-it-py 4.0.0
MarkupSafe 3.0.2
mdurl 0.1.2
ml_collections 1.1.0
mpmath 1.3.0
msgpack 1.1.1
networkx 3.4.2
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12 12.8.90
orjson 3.11.3
packaging 25.0
pandas 2.3.2
pillow 11.3.0
pip 25.1
platformdirs 4.4.0
pooch 1.8.2
psutil 7.0.0
pycparser 2.22
pydantic 2.11.7
pydantic_core 2.33.2
pydub 0.25.1
pyee 13.0.0
Pygments 2.19.2
pylibsrtp 0.12.0
pyOpenSSL 25.1.0
python-dateutil 2.9.0.post0
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
regex 2025.7.34
requests 2.32.5
rich 14.1.0
ruff 0.12.10
safehttpx 0.1.6
safetensors 0.6.2
scikit-learn 1.7.1
scipy 1.15.3
semantic-version 2.10.0
setuptools 78.1.1
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
soundfile 0.13.1
soxr 0.5.0.post1
starlette 0.47.3
sympy 1.14.0
threadpoolctl 3.6.0
tokenizers 0.21.4
tomlkit 0.13.3
torch 2.8.0
tqdm 4.67.1
transformers 4.51.3
triton 3.4.0
typer 0.16.1
typing_extensions 4.15.0
typing-inspection 0.4.1
tzdata 2025.2
urllib3 2.5.0
uvicorn 0.35.0
vibevoice 0.0.1 /workspace/VibeVoice
websockets 15.0.1
wheel 0.45.1
zipp 3.23.0

复制代码

启动gradio推理前端服务页面，注意7B的模型推理环境大概需要19GB的显存

python demo/gradio_demo.py --model_path WestZhang/ --share

APEX FusedRMSNorm not available, using native implementation
🎙️ Initializing VibeVoice Demo with Streaming Support...
Loading processor & model from WestZhang/
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer_config.json
loading file chat_template.jinja from cache at None
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'.
The class this function is called from is 'VibeVoiceTextTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file WestZhang/config.json
Model config VibeVoiceConfig {
"acoustic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"decoder_depths": null,
"decoder_n_filters": 32,
"decoder_ratios": [
8,
5,
5,
4,
2,
2
],
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
8,
5,
5,
4,
2,
2
],
"fix_std": 0.5,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibevoice_acoustic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "gaussian",
"vae_dim": 64,
"weight_init_value": 0.01
},
"acoustic_vae_dim": 64,
"architectures": [
"VibeVoiceForConditionalGeneration"
],
"decoder_config": {
"attention_dropout": 0.0,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"torch_dtype": "bfloat16",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
},
"diffusion_head_config": {
"ddpm_batch_mul": 4,
"ddpm_beta_schedule": "cosine",
"ddpm_num_inference_steps": 20,
"ddpm_num_steps": 1000,
"diffusion_type": "ddpm",
"head_ffn_ratio": 3.0,
"head_layers": 4,
"hidden_size": 3584,
"latent_size": 64,
"model_type": "vibevoice_diffusion_head",
"prediction_type": "v_prediction",
"rms_norm_eps": 1e-05,
"speech_vae_dim": 64
},
"model_type": "vibevoice",
"semantic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
8,
5,
5,
4,
2,
2
],
"fix_std": 0,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibevoice_semantic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "none",
"vae_dim": 128,
"weight_init_value": 0.01
},
"semantic_vae_dim": 128,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3"
}
loading weights file WestZhang/model.safetensors.index.json
Instantiating VibeVoiceForConditionalGenerationInference model under default dtype torch.bfloat16.
Generate config GenerationConfig {}
Instantiating Qwen2Model model under default dtype torch.bfloat16.
Instantiating VibeVoiceAcousticTokenizerModel model under default dtype torch.bfloat16.
Instantiating VibeVoiceSemanticTokenizerModel model under default dtype torch.bfloat16.
Instantiating VibeVoiceDiffusionHead model under default dtype torch.bfloat16.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00, 3.73it/s]
All model checkpoint weights were used when initializing VibeVoiceForConditionalGenerationInference.
All the weights of VibeVoiceForConditionalGenerationInference were initialized from the model checkpoint at WestZhang/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VibeVoiceForConditionalGenerationInference for predictions without further training.
Generation config file not found, using a generation config created from the model config.
Language model attention: flash_attention_2
Found 9 voice files in /workspace/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Loaded example: 1p_Ch2EN.txt with 1 speakers
Loaded example: 1p_abs.txt with 1 speakers
Loaded example: 2p_goat.txt with 2 speakers
Loaded example: 2p_music.txt with 2 speakers
Loaded example: 2p_short.txt with 2 speakers
Loaded example: 2p_yayi.txt with 2 speakers
Loaded example: 3p_gpt5.txt with 3 speakers
Skipping 4p_climate_100min.txt: duration 100 minutes exceeds 15-minute limit
Skipping 4p_climate_45min.txt: duration 45 minutes exceeds 15-minute limit
Successfully loaded 7 example scripts
🚀 Launching demo on port 7860
📁 Model path: WestZhang/
🎭 Available voices: 9
🔴 Streaming mode: ENABLED
🔒 Session isolation: ENABLED
* Running on local URL: http://0.0.0.0:7860

复制代码

然后就可以打开推理服务web页面了，应该是这个样子的

官方给自带3个中文音色

zhuxuan · 昨天 13:45

微软这挺行啊

橙电儿 · 昨天 14:10

希望能再推出轻量一些的

shadowdow · 昨天 14:24

手里拿着3060 Laptop，只能等要求再降低一些才能去体验了

79awcn · 昨天 15:37

去研究研究

快兔悦 · 昨天 15:59

我测试了一下1.5B的模型，输出效果和7B的几乎没什么差别。而且，只需要6G多一点的显存就够了。

bngvixqjyj · 昨天 16:07

我听老外说，3060 12G 就能搞定了，不过得自己量化到 4bit。

xiaoji1543 · 2 小时前

微软昨天发布了最新的开源TTS模型VibeVoice
今天下班早，找了个测试机搭建了个环境，分享给L站里有需要的大佬们。

		自动登录	找回密码
密码			立即注册

评测微软刚刚开源的VibeVoice

7 回复

浏览过的版块