xglm-564M - 提升跨语言AI技术的多语言自回归语言模型

项目介绍：XGLM-564M

什么是XGLM-564M？

XGLM-564M是一个多语言自回归语言模型，拥有5.64亿个参数。该模型经过大规模多语言语料库的训练，包含了30种不同语言，总计5000亿子标记。它是为了在不同语言之间进行少样本学习而开发的，研究成果发表于论文《Few-shot Learning with Multilingual Language Models》中。该模型的实现可以在一个开源库中找到。

训练数据统计

XGLM-564M的训练数据来自多种语言，让我们来看看具体的数据细节：

英语（en）：使用最多，有约8,035亿个标记，占总数据的32.59%。
俄语（ru）：约1,477亿个标记，占数据的6.02%。
中文（zh）：约1,327亿个标记，占数据的4.83%。
德语（de）、西班牙语（es）、**法语（fr）**等其他语言也在训练数据中占据重要位置。

除了这些，占比更小的还有芬兰语、土耳其语、阿拉伯语、越南语等，总共有30种语言，其分布相对均衡，以支持多语言模型的训练。

模型使用信息

模型的详细使用指南可以在XGLM-564M开发团队发布的模型卡中找到，这是模型用户的一个重要资源。

示例应用（COPA任务）

以下是XGLM-564M在“可能的替代选择”任务（COPA）上的应用示例。COPA是一个用于推理的挑战任务，模型需要在两个备选选项中选出更合适的一个。下面展示了如何使用Python代码在COPA任务中做零样本评估：

import torch
import torch.nn.functional as F
from transformers import XGLMTokenizer, XGLMForCausalLM

tokenizer = XGLMTokenizer.from_pretrained("facebook/xglm-564M")
model = XGLMForCausalLM.from_pretrained("facebook/xglm-564M")

data_samples = {
    'en': [
        {
            "premise": "I wanted to conserve energy.",
            "choice1": "I swept the floor in the unoccupied room.",
            "choice2": "I shut off the light in the unoccupied room.",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "The flame on the candle went out.",
            "choice1": "I blew on the wick.",
            "choice2": "I put a match to the wick.",
            "question": "cause",
            "label": "0"
        }
    ],
    'zh': [
        {
            "premise": "我想节约能源。",
            "choice1": "我在空着的房间里扫了地板。",
            "choice2": "我把空房间里的灯关了。",
            "question": "effect",
            "label": "1"
        },
        {
            "premise": "蜡烛上的火焰熄灭了。",
            "choice1": "我吹灭了灯芯。",
            "choice2": "我把一根火柴放在灯芯上。",
            "question": "cause",
            "label": "0"
        }
    ]
}

def get_logprobs(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids, output_ids = inputs["input_ids"], inputs["input_ids"][:, 1:]
    outputs = model(**inputs, labels=input_ids)
    logits = outputs.logits
    logprobs = torch.gather(F.log_softmax(logits, dim=2), 2, output_ids.unsqueeze(2))
    return logprobs

def COPA_eval(prompt, alternative1, alternative2):
    lprob1 = get_logprobs(prompt + "\n" + alternative1).sum()
    lprob2 = get_logprobs(prompt + "\n" + alternative2).sum()
    return 0 if lprob1 > lprob2 else 1

for lang in data_samples:
    for idx, example in enumerate(data_samples[lang]):
        predict = COPA_eval(example["premise"], example["choice1"], example["choice2"])
        print(f'{lang}-{idx}', predict, example['label'])

此代码展示了如何通过XGLM-564M模块来评估模型在不同语言上的推理能力。模型可以根据提供的前提和选项来判断哪一个选项更可能是正确的原因或结果。