persian_xlm_roberta_large - XLM-RoBERTa模型提升波斯语问答表现

项目介绍：Persian XLM-RoBERTA Large 用于问答任务

项目背景

XLM-RoBERTA 是一种多语言模型，在经过2.5TB的数据以及包含100种语言的 CommonCrawl 数据集上进行了预训练。它由Conneau及其团队在论文《Unsupervised Cross-lingual Representation Learning at Scale》中首次提出。

多语言XLM-RoBERTa large针对不同语言的问答(QA)任务进行了微调，而包括 PQuAD 在内的最大波斯语 QA 数据集是其基础。这个模型已经准备好在 PQuAD 数据集的训练集上进行微调，从而帮助需要使用的用户们节省大量的训练时间。

模型训练详情

项目使用了PQuAD训练集进行微调，由于受制于 Google Colab 的 GPU 内存限制，批大小设置为4，训练持续了多个周期，最终选择在一个epoch后的结果来进行评估。

训练的超参数如下：

batch_size = 4
n_epochs = 1
base_LM_model = "deepset/xlm-roberta-large-squad2"
max_seq_len = 256
learning_rate = 3e-5
evaluation_strategy = "epoch"
save_strategy = "epoch"
learning_rate = 3e-5
warmup_ratio = 0.1
gradient_accumulation_steps = 8
weight_decay = 0.01

性能表现

在波斯语问答数据集PQuAD的测试集上进行评估后，该模型展现了优异的性能。与ParsBert模型进行比较时，XLM-RoBERTA模型的准确率（Exact Match）达到66.56%，F1值达到87.31%，而ParsBert分别是47.44%和81.96%。虽然XLM-RoBERTA模型的更大规模让比较有些不公，但其卓越性能仍然值得关注。

指标	XLM-RoBerta Large	ParsBert
准确匹配(Exact Match)	66.56*	47.44
F1	87.31*	81.96

如何使用

对于Pytorch用户：

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
path = 'pedramyazdipoor/persian_xlm_roberta_large'
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForQuestionAnswering.from_pretrained(path)

推理注意事项

答案的起始索引必须小于结束索引。
答案的跨度必须在上下文中。
所选择的答案跨度应该是在N个候选对中最可能的选择。

以下是如何进行推理的示例代码：

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.eval().to(device)
text = 'سلام من پدرامم 26 سالمه'
question = 'چند سالمه؟'
encoding = tokenizer(question,text,add_special_tokens = True,
                     return_token_type_ids = True,
                     return_tensors = 'pt',
                     padding = True,
                     return_offsets_mapping = True,
                     truncation = 'only_first',
                     max_length = 32)
out = model(encoding['input_ids'].to(device),encoding['attention_mask'].to(device), encoding['token_type_ids'].to(device))
answer_start_index, answer_end_index = generate_indexes(out['start_logits'][0][1:], out['end_logits'][0][1:], 5, 0)
print(tokenizer.tokenize(text + question))
print(tokenizer.tokenize(text + question)[answer_start_index : (answer_end_index + 1)])
>>> ['▁سلام', '▁من', '▁پدر', 'ام', 'م', '▁26', '▁سالم', 'ه', 'نام', 'م', '▁چیست', '؟']
>>> ['▁26']

致谢

特别感谢Newsha Shahbodaghkhan在数据集收集方面的支持。

贡献者

Pedram Yazdipoor：Linkedin

版本发布

v0.2版本发布（2022年9月18日）

这是波斯语言XLM-RoBerta-Large的第二个版本。之前的版本使用时遇到了问题。