Project Icon

voicebox-pytorch

新一代MetaAI文本到语音模型Voicebox的Pytorch实现

该项目实现了MetaAI的最新文本到语音模型Voicebox,利用旋转嵌入和自适应归一化技术提升模型效果。还融合了SpearTTS和Conditioned Flow Matching等技术,提高训练和采样效率。项目获得Imminent Grant资助,致力于推动开源文本到语音技术的发展,并感谢各大赞助商的支持。用户可以通过pip install命令轻松安装和使用该项目。

Voicebox - Pytorch

在Pytorch中实现Voicebox,来自MetaAI的新一代文本转语音模型。新闻稿

在这项工作中,我们将使用旋转嵌入。作者似乎没有意识到ALiBi不能简单地用于双向模型。

该论文还解决了时间嵌入错误地受到相对距离影响的问题(他们沿音频令牌的帧维度连接时间嵌入)。这个库将使用自适应归一化,正如在Paella中成功应用的那样。

感谢

  • 感谢Translated授予我Imminent Grant,以推进开源码文本转语音解决方案的状态。本项目在这项资助下启动并将完成。

  • 感谢StabilityAI的慷慨赞助,以及我的其他赞助商,使我能够有独立性开发开源人工智能。

  • 感谢Bryan Chiang的持续代码审查,分享他在TTS方面的专业知识,并指引我到一个开源实现的条件流匹配中。

  • 感谢Manmay帮助这个库以对齐代码开始。

  • 感谢@chenht2010发现旋转位置的一个bug,并验证库中的代码可以收敛。

  • 感谢Lucas Newman(再次)提交了所有用于Spear-TTS条件Voicebox训练的训练代码的pull request!

  • 感谢Lucas Newman展示了整个系统在Spear-TTS条件下运行正常。训练收敛效果比Soundstorm还要好。

安装

$ pip install voicebox-pytorch

使用

使用SpearTTS中的TextToSemantic模块进行训练和采样

import torch

from voicebox_pytorch import (
    VoiceBox,
    EncodecVoco,
    ConditionalFlowMatcherWrapper,
    HubertWithKmeans,
    TextToSemantic
)

# https://github.com/facebookresearch/fairseq/tree/main/examples/hubert

wav2vec = HubertWithKmeans(
    checkpoint_path = '/path/to/hubert/checkpoint.pt',
    kmeans_path = '/path/to/hubert/kmeans.bin'
)

text_to_semantic = TextToSemantic(
    wav2vec = wav2vec,
    dim = 512,
    source_depth = 1,
    target_depth = 1,
    use_openai_tokenizer = True
)

text_to_semantic.load('/path/to/trained/spear-tts/model.pt')

model = VoiceBox(
    dim = 512,
    audio_enc_dec = EncodecVoco(),
    num_cond_tokens = 500,
    depth = 2,
    dim_head = 64,
    heads = 16
)

cfm_wrapper = ConditionalFlowMatcherWrapper(
    voicebox = model,
    text_to_semantic = text_to_semantic
)

# 模拟数据

audio = torch.randn(2, 12000)

# 训练

loss = cfm_wrapper(audio)
loss.backward()

# 经过大量训练之后

texts = [
    '西班牙的雨水主要落在平原上',
    '她在海边卖海贝壳'
]

cond = torch.randn(2, 12000)
sampled = cfm_wrapper.sample(cond = cond, texts = texts) # (2, 1, <音频长度>)

对于无条件训练,VoiceBox上的condition_on_text必须设置为False

import torch
from voicebox_pytorch import (
    VoiceBox,
    ConditionalFlowMatcherWrapper
)

model = VoiceBox(
    dim = 512,
    num_cond_tokens = 500,
    depth = 2,
    dim_head = 64,
    heads = 16,
    condition_on_text = False
)

cfm_wrapper = ConditionalFlowMatcherWrapper(
    voicebox = model
)

# 模拟数据

x = torch.randn(2, 1024, 512)

# 训练

loss = cfm_wrapper(x)

loss.backward()

# 经过大量训练之后

cond = torch.randn(2, 1024, 512)

sampled = cfm_wrapper.sample(cond = cond) # (2, 1024, 512)

待办事项

  • 阅读并内化原始流匹配论文

    • 基本损失
    • 使神经ODE在torchdyn中工作
  • 获取带有0.2-0.3的p_drop的基本掩码生成逻辑用于ICL

  • 处理p_drop,不同于voicebox和持续时间模型

  • 支持torchdiffeq和torchode

  • 切换到自适应rmsnorm用于时间调节

  • 添加encodec / voco作为起步

  • 设置原始音频的训练和采样,如果传入audio_enc_dec

  • 与对数mel频谱/encodec - vocos整合

  • spear-tts集成

  • 基本加速训练器 - 感谢@lucasnewman!

  • 清理NS2对齐器类,然后设置持续时间预测训练

  • 找出MelVoco编码的正确设置,因为重构的音频长度较长

  • 计算每帧对应的秒数,并在AudioEncoderDecoder上添加为属性 - 采样时允许指定秒数

引用

@article{Le2023VoiceboxTM,
    title   = {Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale},
    author  = {Matt Le and Apoorv Vyas and Bowen Shi and Brian Karrer and Leda Sari and Rashel Moritz and Mary Williamson and Vimal Manohar and Yossi Adi and Jay Mahadeokar and Wei-Ning Hsu},
    journal = {ArXiv},
    year    = {2023},
    volume  = {abs/2306.15687},
    url     = {https://api.semanticscholar.org/CorpusID:259275061}
}
@inproceedings{dao2022flashattention,
    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
    booktitle = {Advances in Neural Information Processing Systems},
    year    = {2022}
}
@misc{torchdiffeq,
    author  = {Chen, Ricky T. Q.},
    title   = {torchdiffeq},
    year    = {2018},
    url     = {https://github.com/rtqichen/torchdiffeq},
}
@inproceedings{lienen2022torchode,
    title     = {torchode: A Parallel {ODE} Solver for PyTorch},
    author    = {Marten Lienen and Stephan G{\"u}nnemann},
    booktitle = {The Symbiosis of Deep Learning and Differential Equations II, NeurIPS},
    year      = {2022},
    url       = {https://openreview.net/forum?id=uiKVKTiUYB0}
}
@article{siuzdak2023vocos,
    title   = {Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
    author  = {Siuzdak, Hubert},
    journal = {arXiv preprint arXiv:2306.00814},
    year    = {2023}
}
@misc{darcet2023vision,
    title   = {Vision Transformers Need Registers},
    author  = {Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
    year    = {2023},
    eprint  = {2309.16588},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@inproceedings{Dehghani2023ScalingVT,
    title   = {Scaling Vision Transformers to 22 Billion Parameters},
    author  = {Mostafa Dehghani and Josip Djolonga and Basil Mustafa and Piotr Padlewski and Jonathan Heek and Justin Gilmer and Andreas Steiner and Mathilde Caron and Robert Geirhos and Ibrahim M. Alabdulmohsin and Rodolphe Jenatton and Lucas Beyer and Michael Tschannen and Anurag Arnab and Xiao Wang and Carlos Riquelme and Matthias Minderer and Joan Puigcerver and Utku Evci and Manoj Kumar and Sjoerd van Steenkiste and Gamaleldin F. Elsayed and Aravindh Mahendran and Fisher Yu and Avital Oliver and Fantine Huot and Jasmijn Bastings and Mark Collier and Alexey A. Gritsenko and Vighnesh Birodkar and Cristina Nader Vasconcelos and Yi Tay and Thomas Mensink and Alexander Kolesnikov and Filip Paveti'c and Dustin Tran and Thomas Kipf and Mario Luvci'c and Xiaohua Zhai and Daniel Keysers and Jeremiah Harmsen and Neil Houlsby},
    booktitle = {International Conference on Machine Learning},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:256808367}
}
@inproceedings{Katsch2023GateLoopFD,
    title   = {GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling},
    author  = {Tobias Katsch},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:265018962}
}
项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

白日梦AI

白日梦AI提供专注于AI视频生成的多样化功能,包括文生视频、动态画面和形象生成等,帮助用户快速上手,创造专业级内容。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

讯飞绘镜

讯飞绘镜是一个支持从创意到完整视频创作的智能平台,用户可以快速生成视频素材并创作独特的音乐视频和故事。平台提供多样化的主题和精选作品,帮助用户探索创意灵感。

Project Cover

讯飞文书

讯飞文书依托讯飞星火大模型,为文书写作者提供从素材筹备到稿件撰写及审稿的全程支持。通过录音智记和以稿写稿等功能,满足事务性工作的高频需求,帮助撰稿人节省精力,提高效率,优化工作与生活。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

AIWritePaper论文写作

AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号