微软刚刚开源的VibeVoice_前沿情报_BigSeek AI社区

chnnic 发表于 2025-8-28 13:21:08

微软刚刚开源的VibeVoice

微软昨天发布了最新的开源TTS模型VibeVoice
今天下班早找个测试机搭了个环境，分享给L站里有需要的佬们。1、物理机配置:i7-13700KF,DDR5 128G,Nvidia-3090（24G显存）
2、操作系统:Ubuntu 22.04.5 LTS
3、推理环境:Python 3.10.18
4、模型地址:WestZhang/VibeVoice-Large-pt · Hugging Face
这是7B大杯版，如果有显存8G可以搞1.5B的中杯版还有个0.5B的小杯等待发布中
5、目录结构:(vibevoice) root@dev-3090:/workspace/VibeVoice# tree
.
├── demo
│ ├── example
│ │ ├── 1p_EN2CH.mp4
│ │ ├── 2p_see_u_again.mp4
│ │ └── 4p_climate_45min.mp4
│ ├── gradio_demo.py
│ ├── inference_from_file.py
│ ├── text_examples
│ │ ├── 1p_abs.txt
│ │ ├── 1p_Ch2EN.txt
│ │ ├── 2p_goat.txt
│ │ ├── 2p_music.txt
│ │ ├── 2p_short.txt
│ │ ├── 2p_yayi.txt
│ │ ├── 3p_gpt5.txt
│ │ ├── 4p_climate_100min.txt
│ │ └── 4p_climate_45min.txt
│ └── voices
│    ├── en-Alice_woman.wav
│    ├── en-Carter_man.wav
│    ├── en-Frank_man.wav
│    ├── en-Mary_woman_bgm.wav
│    ├── en-Maya_woman.wav
│    ├── in-Samuel_man.wav
│    ├── zh-Anchen_man_bgm.wav
│    ├── zh-Bowen_man.wav
│    └── zh-Xinran_woman.wav
├── Figures
│ ├── Google_AI_Studio_2025-08-25T21_48_13.452Z.png
│ ├── MOS-preference.png
│ └── VibeVoice.jpg
├── flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
├── LICENSE
├── pyproject.toml
├── README.md
├── SECURITY.md
├── vibevoice
│ ├── configs
│ │ ├── qwen2.5_1.5b_64k.json
│ │ └── qwen2.5_7b_32k.json
│ ├── __init__.py
│ ├── modular
│ │ ├── configuration_vibevoice.py
│ │ ├── __init__.py
│ │ ├── modeling_vibevoice_inference.py
│ │ ├── modeling_vibevoice.py
│ │ ├── modular_vibevoice_diffusion_head.py
│ │ ├── modular_vibevoice_text_tokenizer.py
│ │ ├── modular_vibevoice_tokenizer.py
│ │ ├── __pycache__
│ │ │ ├── configuration_vibevoice.cpython-310.pyc
│ │ │ ├── __init__.cpython-310.pyc
│ │ │ ├── modeling_vibevoice.cpython-310.pyc
│ │ │ ├── modeling_vibevoice_inference.cpython-310.pyc
│ │ │ ├── modular_vibevoice_diffusion_head.cpython-310.pyc
│ │ │ ├── modular_vibevoice_text_tokenizer.cpython-310.pyc
│ │ │ ├── modular_vibevoice_tokenizer.cpython-310.pyc
│ │ │ └── streamer.cpython-310.pyc
│ │ └── streamer.py
│ ├── processor
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ │ ├── __init__.cpython-310.pyc
│ │ │ ├── vibevoice_processor.cpython-310.pyc
│ │ │ └── vibevoice_tokenizer_processor.cpython-310.pyc
│ │ ├── vibevoice_processor.py
│ │ └── vibevoice_tokenizer_processor.py
│ ├── __pycache__
│ │ └── __init__.cpython-310.pyc
│ ├── schedule
│ │ ├── dpm_solver.py
│ │ ├── __init__.py
│ │ ├── __pycache__
│ │ │ ├── dpm_solver.cpython-310.pyc
│ │ │ └── __init__.cpython-310.pyc
│ │ └── timestep_sampler.py
│ └── scripts
│    ├── convert_nnscaler_checkpoint_to_transformers.py
│    └── __init__.py
├── vibevoice.egg-info
│ ├── dependency_links.txt
│ ├── PKG-INFO
│ ├── requires.txt
│ ├── SOURCES.txt
│ └── top_level.txt
└── WestZhang
├── config.json
├── model-00001-of-00010.safetensors
├── model-00002-of-00010.safetensors
├── model-00003-of-00010.safetensors
├── model-00004-of-00010.safetensors
├── model-00005-of-00010.safetensors
├── model-00006-of-00010.safetensors
├── model-00007-of-00010.safetensors
├── model-00008-of-00010.safetensors
├── model-00009-of-00010.safetensors
├── model-00010-of-00010.safetensors
├── model.safetensors.index.json
└── preprocessor_config.json

17 directories, 82 filesgit拉完项目正常按照install方法安装的，不过需要自己再单独装下flash-attention最后推理环境的py扩展如下：
pip list
Package                Version Editable project location
------------------------ ----------- -------------------------
absl-py                2.3.1
accelerate             1.6.0
aiofiles             24.1.0
aioice                0.10.1
aiortc                1.13.0
annotated-types       0.7.0
anyio                4.10.0
audioread             3.0.1
av                   14.4.0
Brotli                1.1.0
certifi                2025.8.3
cffi                   1.17.1
charset-normalizer    3.4.3
click                8.2.1
cryptography          45.0.6
decorator             5.2.1
diffusers             0.35.1
dnspython             2.7.0
einops                0.8.1
exceptiongroup       1.3.0
fastapi                0.116.1
ffmpy                0.6.1
filelock             3.19.1
flash_attn             2.8.3
fsspec                2025.7.0
google-crc32c          1.7.1
gradio                5.44.0
gradio_client          1.12.1
groovy                0.1.2
h11                   0.16.0
hf-xet                1.1.8
httpcore             1.0.9
httpx                0.28.1
huggingface-hub       0.34.4
idna                   3.10
ifaddr                0.2.0
importlib_metadata    8.7.0
Jinja2                3.1.6
joblib                1.5.1
lazy_loader          0.4
librosa                0.11.0
llvmlite             0.44.0
markdown-it-py       4.0.0
MarkupSafe             3.0.2
mdurl                0.1.2
ml_collections       1.1.0
mpmath                1.3.0
msgpack                1.1.1
networkx             3.4.2
numba                0.61.2
numpy                2.2.6
nvidia-cublas-cu12    12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12    9.10.2.21
nvidia-cufft-cu12    11.3.3.83
nvidia-cufile-cu12    1.13.1.3
nvidia-curand-cu12    10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-nccl-cu12       2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12       12.8.90
orjson                3.11.3
packaging             25.0
pandas                2.3.2
pillow                11.3.0
pip                   25.1
platformdirs          4.4.0
pooch                1.8.2
psutil                7.0.0
pycparser             2.22
pydantic             2.11.7
pydantic_core          2.33.2
pydub                0.25.1
pyee                   13.0.0
Pygments             2.19.2
pylibsrtp             0.12.0
pyOpenSSL             25.1.0
python-dateutil       2.9.0.post0
python-multipart       0.0.20
pytz                   2025.2
PyYAML                6.0.2
regex                2025.7.34
requests             2.32.5
rich                   14.1.0
ruff                   0.12.10
safehttpx             0.1.6
safetensors          0.6.2
scikit-learn          1.7.1
scipy                1.15.3
semantic-version       2.10.0
setuptools             78.1.1
shellingham          1.5.4
six                   1.17.0
sniffio                1.3.1
soundfile             0.13.1
soxr                   0.5.0.post1
starlette             0.47.3
sympy                1.14.0
threadpoolctl          3.6.0
tokenizers             0.21.4
tomlkit                0.13.3
torch                2.8.0
tqdm                   4.67.1
transformers          4.51.3
triton                3.4.0
typer                0.16.1
typing_extensions    4.15.0
typing-inspection    0.4.1
tzdata                2025.2
urllib3                2.5.0
uvicorn                0.35.0
vibevoice             0.0.1    /workspace/VibeVoice
websockets             15.0.1
wheel                0.45.1
zipp                   3.23.0启动gradio推理前端服务页面，注意7B的模型推理环境大概需要19GB的显存
python demo/gradio_demo.py --model_path WestZhang/ --share
APEX FusedRMSNorm not available, using native implementation
🎙️ Initializing VibeVoice Demo with Streaming Support...
Loading processor & model from WestZhang/
loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer_config.json
loading file chat_template.jinja from cache at None
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'.
The class this function is called from is 'VibeVoiceTextTokenizerFast'.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file WestZhang/config.json
Model config VibeVoiceConfig {
"acoustic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"decoder_depths": null,
"decoder_n_filters": 32,
"decoder_ratios": [
   8,
   5,
   5,
   4,
   2,
   2
],
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
   8,
   5,
   5,
   4,
   2,
   2
],
"fix_std": 0.5,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibevoice_acoustic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "gaussian",
"vae_dim": 64,
"weight_init_value": 0.01
},
"acoustic_vae_dim": 64,
"architectures": [
"VibeVoiceForConditionalGeneration"
],
"decoder_config": {
"attention_dropout": 0.0,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"torch_dtype": "bfloat16",
"use_cache": true,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
},
"diffusion_head_config": {
"ddpm_batch_mul": 4,
"ddpm_beta_schedule": "cosine",
"ddpm_num_inference_steps": 20,
"ddpm_num_steps": 1000,
"diffusion_type": "ddpm",
"head_ffn_ratio": 3.0,
"head_layers": 4,
"hidden_size": 3584,
"latent_size": 64,
"model_type": "vibevoice_diffusion_head",
"prediction_type": "v_prediction",
"rms_norm_eps": 1e-05,
"speech_vae_dim": 64
},
"model_type": "vibevoice",
"semantic_tokenizer_config": {
"causal": true,
"channels": 1,
"conv_bias": true,
"conv_norm": "none",
"corpus_normalize": 0.0,
"disable_last_norm": true,
"encoder_depths": "3-3-3-3-3-3-8",
"encoder_n_filters": 32,
"encoder_ratios": [
   8,
   5,
   5,
   4,
   2,
   2
],
"fix_std": 0,
"layer_scale_init_value": 1e-06,
"layernorm": "RMSNorm",
"layernorm_elementwise_affine": true,
"layernorm_eps": 1e-05,
"mixer_layer": "depthwise_conv",
"model_type": "vibevoice_semantic_tokenizer",
"pad_mode": "constant",
"std_dist_type": "none",
"vae_dim": 128,
"weight_init_value": 0.01
},
"semantic_vae_dim": 128,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3"
}

loading weights file WestZhang/model.safetensors.index.json
Instantiating VibeVoiceForConditionalGenerationInference model under default dtype torch.bfloat16.
Generate config GenerationConfig {}

Instantiating Qwen2Model model under default dtype torch.bfloat16.
Instantiating VibeVoiceAcousticTokenizerModel model under default dtype torch.bfloat16.
Instantiating VibeVoiceSemanticTokenizerModel model under default dtype torch.bfloat16.
Instantiating VibeVoiceDiffusionHead model under default dtype torch.bfloat16.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10
All model checkpoint weights were used when initializing VibeVoiceForConditionalGenerationInference.

All the weights of VibeVoiceForConditionalGenerationInference were initialized from the model checkpoint at WestZhang/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VibeVoiceForConditionalGenerationInference for predictions without further training.
Generation config file not found, using a generation config created from the model config.
Language model attention: flash_attention_2
Found 9 voice files in /workspace/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Loaded example: 1p_Ch2EN.txt with 1 speakers
Loaded example: 1p_abs.txt with 1 speakers
Loaded example: 2p_goat.txt with 2 speakers
Loaded example: 2p_music.txt with 2 speakers
Loaded example: 2p_short.txt with 2 speakers
Loaded example: 2p_yayi.txt with 2 speakers
Loaded example: 3p_gpt5.txt with 3 speakers
Skipping 4p_climate_100min.txt: duration 100 minutes exceeds 15-minute limit
Skipping 4p_climate_45min.txt: duration 45 minutes exceeds 15-minute limit
Successfully loaded 7 example scripts
🚀 Launching demo on port 7860
📁 Model path: WestZhang/
🎭 Available voices: 9
🔴 Streaming mode: ENABLED
🔒 Session isolation: ENABLED
* Running on local URL:http://0.0.0.0:7860然后就可以打开推理服务web页面了，应该是这个样子的

官方给自带3个中文音色

zhuxuan 发表于 2025-8-28 13:45:04

微软这挺行啊

橙电儿 发表于 2025-8-28 14:10:03

希望能再推出轻量一些的

shadowdow 发表于 2025-8-28 14:24:03

手里拿着3060 Laptop，只能等要求再降低一些才能去体验了

79awcn 发表于 2025-8-28 15:37:10

去研究研究

快兔悦 发表于 2025-8-28 15:59:04

我测试了一下1.5B的模型，输出效果和7B的几乎没什么差别。而且，只需要6G多一点的显存就够了。

bngvixqjyj 发表于 2025-8-28 16:07:04

我听老外说，3060 12G 就能搞定了，不过得自己量化到 4bit。

xiaoji1543 发表于 2025-8-29 08:10:23

微软昨天发布了最新的开源TTS模型VibeVoice
今天下班早，找了个测试机搭建了个环境，分享给L站里有需要的大佬们。

125372883 发表于 6 天前

因为刚推出，估计过几天各种量化版本就都出来了

dxking 发表于 5 天前

玩不起。。没这配置啊

亮情电 发表于 5 天前

7B的模型确实挺大的，不过官方还推出了1.5B的中杯版本。只要有8G显存就能运行起来

页: [1]

BigSeek_AI社区_AI论坛's Archiver

微软刚刚开源的VibeVoice