发帖
 找回密码
 立即注册
搜索
0 0 0
前沿情报 487 0 昨天 13:21
微软昨天发布了最新的开源TTS模型VibeVoice
今天下班早找个测试机搭了个环境,分享给L站里有需要的佬们。
1、物理机配置:i7-13700KF,DDR5 128G,Nvidia-3090(24G显存)
2、操作系统:Ubuntu 22.04.5 LTS
3、推理环境:Python 3.10.18
4、模型地址:WestZhang/VibeVoice-Large-pt · Hugging Face
这是7B大杯版,如果有显存8G可以搞1.5B的中杯版还有个0.5B的小杯等待发布中
5、目录结构:
  1. (vibevoice) root@dev-3090:/workspace/VibeVoice# tree
  2. .
  3. ├── demo
  4. │   ├── example
  5. │   │   ├── 1p_EN2CH.mp4
  6. │   │   ├── 2p_see_u_again.mp4
  7. │   │   └── 4p_climate_45min.mp4
  8. │   ├── gradio_demo.py
  9. │   ├── inference_from_file.py
  10. │   ├── text_examples
  11. │   │   ├── 1p_abs.txt
  12. │   │   ├── 1p_Ch2EN.txt
  13. │   │   ├── 2p_goat.txt
  14. │   │   ├── 2p_music.txt
  15. │   │   ├── 2p_short.txt
  16. │   │   ├── 2p_yayi.txt
  17. │   │   ├── 3p_gpt5.txt
  18. │   │   ├── 4p_climate_100min.txt
  19. │   │   └── 4p_climate_45min.txt
  20. │   └── voices
  21. │       ├── en-Alice_woman.wav
  22. │       ├── en-Carter_man.wav
  23. │       ├── en-Frank_man.wav
  24. │       ├── en-Mary_woman_bgm.wav
  25. │       ├── en-Maya_woman.wav
  26. │       ├── in-Samuel_man.wav
  27. │       ├── zh-Anchen_man_bgm.wav
  28. │       ├── zh-Bowen_man.wav
  29. │       └── zh-Xinran_woman.wav
  30. ├── Figures
  31. │   ├── Google_AI_Studio_2025-08-25T21_48_13.452Z.png
  32. │   ├── MOS-preference.png
  33. │   └── VibeVoice.jpg
  34. ├── flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
  35. ├── LICENSE
  36. ├── pyproject.toml
  37. ├── README.md
  38. ├── SECURITY.md
  39. ├── vibevoice
  40. │   ├── configs
  41. │   │   ├── qwen2.5_1.5b_64k.json
  42. │   │   └── qwen2.5_7b_32k.json
  43. │   ├── __init__.py
  44. │   ├── modular
  45. │   │   ├── configuration_vibevoice.py
  46. │   │   ├── __init__.py
  47. │   │   ├── modeling_vibevoice_inference.py
  48. │   │   ├── modeling_vibevoice.py
  49. │   │   ├── modular_vibevoice_diffusion_head.py
  50. │   │   ├── modular_vibevoice_text_tokenizer.py
  51. │   │   ├── modular_vibevoice_tokenizer.py
  52. │   │   ├── __pycache__
  53. │   │   │   ├── configuration_vibevoice.cpython-310.pyc
  54. │   │   │   ├── __init__.cpython-310.pyc
  55. │   │   │   ├── modeling_vibevoice.cpython-310.pyc
  56. │   │   │   ├── modeling_vibevoice_inference.cpython-310.pyc
  57. │   │   │   ├── modular_vibevoice_diffusion_head.cpython-310.pyc
  58. │   │   │   ├── modular_vibevoice_text_tokenizer.cpython-310.pyc
  59. │   │   │   ├── modular_vibevoice_tokenizer.cpython-310.pyc
  60. │   │   │   └── streamer.cpython-310.pyc
  61. │   │   └── streamer.py
  62. │   ├── processor
  63. │   │   ├── __init__.py
  64. │   │   ├── __pycache__
  65. │   │   │   ├── __init__.cpython-310.pyc
  66. │   │   │   ├── vibevoice_processor.cpython-310.pyc
  67. │   │   │   └── vibevoice_tokenizer_processor.cpython-310.pyc
  68. │   │   ├── vibevoice_processor.py
  69. │   │   └── vibevoice_tokenizer_processor.py
  70. │   ├── __pycache__
  71. │   │   └── __init__.cpython-310.pyc
  72. │   ├── schedule
  73. │   │   ├── dpm_solver.py
  74. │   │   ├── __init__.py
  75. │   │   ├── __pycache__
  76. │   │   │   ├── dpm_solver.cpython-310.pyc
  77. │   │   │   └── __init__.cpython-310.pyc
  78. │   │   └── timestep_sampler.py
  79. │   └── scripts
  80. │       ├── convert_nnscaler_checkpoint_to_transformers.py
  81. │       └── __init__.py
  82. ├── vibevoice.egg-info
  83. │   ├── dependency_links.txt
  84. │   ├── PKG-INFO
  85. │   ├── requires.txt
  86. │   ├── SOURCES.txt
  87. │   └── top_level.txt
  88. └── WestZhang
  89.     ├── config.json
  90.     ├── model-00001-of-00010.safetensors
  91.     ├── model-00002-of-00010.safetensors
  92.     ├── model-00003-of-00010.safetensors
  93.     ├── model-00004-of-00010.safetensors
  94.     ├── model-00005-of-00010.safetensors
  95.     ├── model-00006-of-00010.safetensors
  96.     ├── model-00007-of-00010.safetensors
  97.     ├── model-00008-of-00010.safetensors
  98.     ├── model-00009-of-00010.safetensors
  99.     ├── model-00010-of-00010.safetensors
  100.     ├── model.safetensors.index.json
  101.     └── preprocessor_config.json

  102. 17 directories, 82 files
复制代码
git拉完项目正常按照install方法安装的,不过需要自己再单独装下flash-attention最后推理环境的py扩展如下:
  1. pip list
  2. Package                  Version     Editable project location
  3. ------------------------ ----------- -------------------------
  4. absl-py                  2.3.1
  5. accelerate               1.6.0
  6. aiofiles                 24.1.0
  7. aioice                   0.10.1
  8. aiortc                   1.13.0
  9. annotated-types          0.7.0
  10. anyio                    4.10.0
  11. audioread                3.0.1
  12. av                       14.4.0
  13. Brotli                   1.1.0
  14. certifi                  2025.8.3
  15. cffi                     1.17.1
  16. charset-normalizer       3.4.3
  17. click                    8.2.1
  18. cryptography             45.0.6
  19. decorator                5.2.1
  20. diffusers                0.35.1
  21. dnspython                2.7.0
  22. einops                   0.8.1
  23. exceptiongroup           1.3.0
  24. fastapi                  0.116.1
  25. ffmpy                    0.6.1
  26. filelock                 3.19.1
  27. flash_attn               2.8.3
  28. fsspec                   2025.7.0
  29. google-crc32c            1.7.1
  30. gradio                   5.44.0
  31. gradio_client            1.12.1
  32. groovy                   0.1.2
  33. h11                      0.16.0
  34. hf-xet                   1.1.8
  35. httpcore                 1.0.9
  36. httpx                    0.28.1
  37. huggingface-hub          0.34.4
  38. idna                     3.10
  39. ifaddr                   0.2.0
  40. importlib_metadata       8.7.0
  41. Jinja2                   3.1.6
  42. joblib                   1.5.1
  43. lazy_loader              0.4
  44. librosa                  0.11.0
  45. llvmlite                 0.44.0
  46. markdown-it-py           4.0.0
  47. MarkupSafe               3.0.2
  48. mdurl                    0.1.2
  49. ml_collections           1.1.0
  50. mpmath                   1.3.0
  51. msgpack                  1.1.1
  52. networkx                 3.4.2
  53. numba                    0.61.2
  54. numpy                    2.2.6
  55. nvidia-cublas-cu12       12.8.4.1
  56. nvidia-cuda-cupti-cu12   12.8.90
  57. nvidia-cuda-nvrtc-cu12   12.8.93
  58. nvidia-cuda-runtime-cu12 12.8.90
  59. nvidia-cudnn-cu12        9.10.2.21
  60. nvidia-cufft-cu12        11.3.3.83
  61. nvidia-cufile-cu12       1.13.1.3
  62. nvidia-curand-cu12       10.3.9.90
  63. nvidia-cusolver-cu12     11.7.3.90
  64. nvidia-cusparse-cu12     12.5.8.93
  65. nvidia-cusparselt-cu12   0.7.1
  66. nvidia-nccl-cu12         2.27.3
  67. nvidia-nvjitlink-cu12    12.8.93
  68. nvidia-nvtx-cu12         12.8.90
  69. orjson                   3.11.3
  70. packaging                25.0
  71. pandas                   2.3.2
  72. pillow                   11.3.0
  73. pip                      25.1
  74. platformdirs             4.4.0
  75. pooch                    1.8.2
  76. psutil                   7.0.0
  77. pycparser                2.22
  78. pydantic                 2.11.7
  79. pydantic_core            2.33.2
  80. pydub                    0.25.1
  81. pyee                     13.0.0
  82. Pygments                 2.19.2
  83. pylibsrtp                0.12.0
  84. pyOpenSSL                25.1.0
  85. python-dateutil          2.9.0.post0
  86. python-multipart         0.0.20
  87. pytz                     2025.2
  88. PyYAML                   6.0.2
  89. regex                    2025.7.34
  90. requests                 2.32.5
  91. rich                     14.1.0
  92. ruff                     0.12.10
  93. safehttpx                0.1.6
  94. safetensors              0.6.2
  95. scikit-learn             1.7.1
  96. scipy                    1.15.3
  97. semantic-version         2.10.0
  98. setuptools               78.1.1
  99. shellingham              1.5.4
  100. six                      1.17.0
  101. sniffio                  1.3.1
  102. soundfile                0.13.1
  103. soxr                     0.5.0.post1
  104. starlette                0.47.3
  105. sympy                    1.14.0
  106. threadpoolctl            3.6.0
  107. tokenizers               0.21.4
  108. tomlkit                  0.13.3
  109. torch                    2.8.0
  110. tqdm                     4.67.1
  111. transformers             4.51.3
  112. triton                   3.4.0
  113. typer                    0.16.1
  114. typing_extensions        4.15.0
  115. typing-inspection        0.4.1
  116. tzdata                   2025.2
  117. urllib3                  2.5.0
  118. uvicorn                  0.35.0
  119. vibevoice                0.0.1       /workspace/VibeVoice
  120. websockets               15.0.1
  121. wheel                    0.45.1
  122. zipp                     3.23.0
复制代码
启动gradio推理前端服务页面,注意7B的模型推理环境大概需要19GB的显存
python demo/gradio_demo.py --model_path WestZhang/ --share
  1. APEX FusedRMSNorm not available, using native implementation
  2. 🎙️ Initializing VibeVoice Demo with Streaming Support...
  3. Loading processor & model from WestZhang/
  4. loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/vocab.json
  5. loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/merges.txt
  6. loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer.json
  7. loading file added_tokens.json from cache at None
  8. loading file special_tokens_map.json from cache at None
  9. loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-7B/snapshots/d149729398750b98c0af14eb82c78cfe92750796/tokenizer_config.json
  10. loading file chat_template.jinja from cache at None
  11. The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
  12. The tokenizer class you load from this checkpoint is 'Qwen2Tokenizer'.
  13. The class this function is called from is 'VibeVoiceTextTokenizerFast'.
  14. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
  15. loading configuration file WestZhang/config.json
  16. Model config VibeVoiceConfig {
  17.   "acoustic_tokenizer_config": {
  18.     "causal": true,
  19.     "channels": 1,
  20.     "conv_bias": true,
  21.     "conv_norm": "none",
  22.     "corpus_normalize": 0.0,
  23.     "decoder_depths": null,
  24.     "decoder_n_filters": 32,
  25.     "decoder_ratios": [
  26.       8,
  27.       5,
  28.       5,
  29.       4,
  30.       2,
  31.       2
  32.     ],
  33.     "disable_last_norm": true,
  34.     "encoder_depths": "3-3-3-3-3-3-8",
  35.     "encoder_n_filters": 32,
  36.     "encoder_ratios": [
  37.       8,
  38.       5,
  39.       5,
  40.       4,
  41.       2,
  42.       2
  43.     ],
  44.     "fix_std": 0.5,
  45.     "layer_scale_init_value": 1e-06,
  46.     "layernorm": "RMSNorm",
  47.     "layernorm_elementwise_affine": true,
  48.     "layernorm_eps": 1e-05,
  49.     "mixer_layer": "depthwise_conv",
  50.     "model_type": "vibevoice_acoustic_tokenizer",
  51.     "pad_mode": "constant",
  52.     "std_dist_type": "gaussian",
  53.     "vae_dim": 64,
  54.     "weight_init_value": 0.01
  55.   },
  56.   "acoustic_vae_dim": 64,
  57.   "architectures": [
  58.     "VibeVoiceForConditionalGeneration"
  59.   ],
  60.   "decoder_config": {
  61.     "attention_dropout": 0.0,
  62.     "hidden_act": "silu",
  63.     "hidden_size": 3584,
  64.     "initializer_range": 0.02,
  65.     "intermediate_size": 18944,
  66.     "max_position_embeddings": 32768,
  67.     "max_window_layers": 28,
  68.     "model_type": "qwen2",
  69.     "num_attention_heads": 28,
  70.     "num_hidden_layers": 28,
  71.     "num_key_value_heads": 4,
  72.     "rms_norm_eps": 1e-06,
  73.     "rope_scaling": null,
  74.     "rope_theta": 1000000.0,
  75.     "sliding_window": null,
  76.     "torch_dtype": "bfloat16",
  77.     "use_cache": true,
  78.     "use_mrope": false,
  79.     "use_sliding_window": false,
  80.     "vocab_size": 152064
  81.   },
  82.   "diffusion_head_config": {
  83.     "ddpm_batch_mul": 4,
  84.     "ddpm_beta_schedule": "cosine",
  85.     "ddpm_num_inference_steps": 20,
  86.     "ddpm_num_steps": 1000,
  87.     "diffusion_type": "ddpm",
  88.     "head_ffn_ratio": 3.0,
  89.     "head_layers": 4,
  90.     "hidden_size": 3584,
  91.     "latent_size": 64,
  92.     "model_type": "vibevoice_diffusion_head",
  93.     "prediction_type": "v_prediction",
  94.     "rms_norm_eps": 1e-05,
  95.     "speech_vae_dim": 64
  96.   },
  97.   "model_type": "vibevoice",
  98.   "semantic_tokenizer_config": {
  99.     "causal": true,
  100.     "channels": 1,
  101.     "conv_bias": true,
  102.     "conv_norm": "none",
  103.     "corpus_normalize": 0.0,
  104.     "disable_last_norm": true,
  105.     "encoder_depths": "3-3-3-3-3-3-8",
  106.     "encoder_n_filters": 32,
  107.     "encoder_ratios": [
  108.       8,
  109.       5,
  110.       5,
  111.       4,
  112.       2,
  113.       2
  114.     ],
  115.     "fix_std": 0,
  116.     "layer_scale_init_value": 1e-06,
  117.     "layernorm": "RMSNorm",
  118.     "layernorm_elementwise_affine": true,
  119.     "layernorm_eps": 1e-05,
  120.     "mixer_layer": "depthwise_conv",
  121.     "model_type": "vibevoice_semantic_tokenizer",
  122.     "pad_mode": "constant",
  123.     "std_dist_type": "none",
  124.     "vae_dim": 128,
  125.     "weight_init_value": 0.01
  126.   },
  127.   "semantic_vae_dim": 128,
  128.   "tie_word_embeddings": false,
  129.   "torch_dtype": "bfloat16",
  130.   "transformers_version": "4.51.3"
  131. }

  132. loading weights file WestZhang/model.safetensors.index.json
  133. Instantiating VibeVoiceForConditionalGenerationInference model under default dtype torch.bfloat16.
  134. Generate config GenerationConfig {}

  135. Instantiating Qwen2Model model under default dtype torch.bfloat16.
  136. Instantiating VibeVoiceAcousticTokenizerModel model under default dtype torch.bfloat16.
  137. Instantiating VibeVoiceSemanticTokenizerModel model under default dtype torch.bfloat16.
  138. Instantiating VibeVoiceDiffusionHead model under default dtype torch.bfloat16.
  139. Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  3.73it/s]
  140. All model checkpoint weights were used when initializing VibeVoiceForConditionalGenerationInference.

  141. All the weights of VibeVoiceForConditionalGenerationInference were initialized from the model checkpoint at WestZhang/.
  142. If your task is similar to the task the model of the checkpoint was trained on, you can already use VibeVoiceForConditionalGenerationInference for predictions without further training.
  143. Generation config file not found, using a generation config created from the model config.
  144. Language model attention: flash_attention_2
  145. Found 9 voice files in /workspace/VibeVoice/demo/voices
  146. Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
  147. Loaded example: 1p_Ch2EN.txt with 1 speakers
  148. Loaded example: 1p_abs.txt with 1 speakers
  149. Loaded example: 2p_goat.txt with 2 speakers
  150. Loaded example: 2p_music.txt with 2 speakers
  151. Loaded example: 2p_short.txt with 2 speakers
  152. Loaded example: 2p_yayi.txt with 2 speakers
  153. Loaded example: 3p_gpt5.txt with 3 speakers
  154. Skipping 4p_climate_100min.txt: duration 100 minutes exceeds 15-minute limit
  155. Skipping 4p_climate_45min.txt: duration 45 minutes exceeds 15-minute limit
  156. Successfully loaded 7 example scripts
  157. 🚀 Launching demo on port 7860
  158. 📁 Model path: WestZhang/
  159. 🎭 Available voices: 9
  160. 🔴 Streaming mode: ENABLED
  161. 🔒 Session isolation: ENABLED
  162. * Running on local URL:  http://0.0.0.0:7860
复制代码
然后就可以打开推理服务web页面了,应该是这个样子的
af4fc69c3f943259848d2ad086ac3f5db96af6a3.jpeg
官方给自带3个中文音色





──── 0人觉得很赞 ────

使用道具 举报

微软这挺行啊
希望能再推出轻量一些的
手里拿着3060 Laptop,只能等要求再降低一些才能去体验了
去研究研究
我测试了一下1.5B的模型,输出效果和7B的几乎没什么差别。而且,只需要6G多一点的显存就够了。  
我听老外说,3060 12G 就能搞定了,不过得自己量化到 4bit。
微软昨天发布了最新的开源TTS模型VibeVoice
今天下班早,找了个测试机搭建了个环境,分享给L站里有需要的大佬们。
您需要登录后才可以回帖 立即登录
高级模式