larger_clap_music - 大规模音乐音频分类及特征提取的模型解决方案

larger_clap_music项目简介

项目概述

larger_clap_music是一个针对音乐优化的CLAP模型。CLAP，即对比语言-音频预训练，是一种神经网络模型，类似于CLIP用于图像的方法。该模型经过精心训练，可以处理音频与文本配对的数据，并在无需直接为特定任务优化的情况下，预测最相关的文本片段。主要技术上，CLAP模型使用SWINTransformer从日志Mel谱图输入中提取音频特征，并使用RoBERTa模型提取文本特征。随后，这些文本和音频特征被投射到相同维度的潜在空间中，并通过投影特征之间的点积计算相似度得分。

功能与应用

零样本音频分类

larger_clap_music模型可以用于零样本音频分类，这意味着无需额外的训练数据即可识别音频类别。在实际应用中，可以使用Python代码中的pipeline方法，快速将音频样本分类为不同的类别。例如，给定音频样本，模型可以判断它是"狗叫声"还是"吸尘器声音"。

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("ashraq/esc50")
audio = dataset["train"]["audio"][-1]["array"]

audio_classifier = pipeline(task="zero-shot-audio-classification", model="laion/larger_clap_music")
output = audio_classifier(audio, candidate_labels=["Sound of a dog", "Sound of vaccum cleaner"])
print(output)
>>> [{"score": 0.999, "label": "Sound of a dog"}, {"score": 0.001, "label": "Sound of vaccum cleaner"}]

获取音频和文本嵌入

除了分类功能，larger_clap_music模型还允许用户提取音频和文本的特征嵌入。通过ClapModel和ClapProcessor在CPU或GPU上运行模型，可以获得音频样本的特征嵌入。这样的嵌入可以应用于更复杂的音频处理任务中，如音频-文本对齐或语义分析。

在CPU上运行示例代码：

from datasets import load_dataset
from transformers import ClapModel, ClapProcessor

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[0]

model = ClapModel.from_pretrained("laion/larger_clap_music")
processor = ClapProcessor.from_pretrained("laion/larger_clap_music")

inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt")
audio_embed = model.get_audio_features(**inputs)

在GPU上运行示例代码：

from datasets import load_dataset
from transformers import ClapModel, ClapProcessor

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
audio_sample = librispeech_dummy[0]

model = ClapModel.from_pretrained("laion/larger_clap_music").to(0)
processor = ClapProcessor.from_pretrained("laion/larger_clap_music")

inputs = processor(audios=audio_sample["audio"]["array"], return_tensors="pt").to(0)
audio_embed = model.get_audio_features(**inputs)

项目引用

如果在工作中使用此模型，请引用原始论文：

@misc{https://doi.org/10.48550/arxiv.2211.06687,
  doi = {10.48550/ARXIV.2211.06687},
  url = {https://arxiv.org/abs/2211.06687},
  author = {Wu, Yusong and Chen, Ke and Zhang, Tianyu and Hui, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},
  keywords = {Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

通过该项目，开发者可以在音乐和音频分析领域实现更加高效的创新应用，特别是在需要音频与文本互动和分类的场景中。