[![PWC](https://yellow-cdn.veclightyear.com/2b54e442/18506a7a-24ed-4f38-82ff-860d50a401fc.svg?url=https://paperswithcode.com/badge/rebel-relation-extraction-by-end-to-end/relation-extraction-on-nyt)](https://paperswithcode.com/sota/relation-extraction-on-nyt?p=rebel-relation-extraction-by-end-to-end)
更新:
mREBEL来了。我们提供了两个用于多语言关系抽取的新数据集以及一系列mREBEL版本。前往 部分 查看。
REBEL: 通过端到端语言生成的关系抽取
这是EMNLP 2021论文“REBEL: 通过端到端语言生成的关系抽取”的存储库。我们提出了一种新的线性化方法,并将关系抽取重新框架为一个seq2seq任务。可以在这里找到本文。如果您使用了代码,请在论文中引述这项工作:
@inproceedings{huguet-cabot-navigli-2021-rebel-relation,
title = "{REBEL}: Relation Extraction By End-to-end Language generation",
author = "Huguet Cabot, Pere-Llu{\'\i}s and
Navigli, Roberto",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.204",
pages = "2370--2381",
abstract = "Extracting relation triplets from raw text is a crucial task in Information Extraction, enabling multiple applications such as populating or validating knowledge bases, factchecking, and other downstream tasks. However, it usually involves multiple-step pipelines that propagate errors or are limited to a small number of relation types. To overcome these issues, we propose the use of autoregressive seq2seq models. Such models have previously been shown to perform well not only in language generation, but also in NLU tasks such as Entity Linking, thanks to their framing as seq2seq tasks. In this paper, we show how Relation Extraction can be simplified by expressing triplets as a sequence of text and we present REBEL, a seq2seq model based on BART that performs end-to-end relation extraction for more than 200 different relation types. We show our model{'}s flexibility by fine-tuning it on an array of Relation Extraction and Relation Classification benchmarks, with it attaining state-of-the-art performance in most of them.",
}
项目结构
| conf # 包含Hydra配置文件
| data
| model
| train
root.yaml # hydra根配置文件
| data # 数据
| datasets # 数据集脚本
| model # 模型文件应该存放在这里
| src
| pl_data_modules.py # LightinigDataModule
| pl_modules.py # LightningModule
| train.py # 训练网络的主脚本
| test.py # 测试网络的主脚本
| README.md
| requirements.txt
| demo.py # Streamlit演示以尝试模型
| setup.sh # 环境设置脚本
初始化环境
为了设置python解释器, 我们使用 conda, setup.sh脚本创建一个conda环境并安装pytorch和"requirements.txt"中的依赖项。
REBEL模型和数据集
模型和数据集文件可以在这里下载:
https://osf.io/4x3r9/?view_only=87e7af84c0564bd1b3eadff23e4b7e54
或者您可以直接从Huggingface repo使用模型:
https://huggingface.co/Babelscape/rebel-large
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
# 我们需要手动使用分词器,因为我们需要特殊标记。
extracted_text = triplet_extractor.tokenizer.batch_decode([triplet_extractor("Punta Cana is a resort town in the municipality of Higuey, in La Altagracia Province, the eastern most province of the Dominican Republic", return_tensors=True, return_text=False)[0]["generated_token_ids"]])
print(extracted_text[0])
# 解析生成的文本并提取三元组的函数
def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets
extracted_triplets = extract_triplets(extracted_text[0])
print(extracted_triplets)
CROCODILE: 用nLi过滤的自动关系抽取数据集。
使用我们的关系抽取数据集创建器CROCODILE可以重新创建REBEL数据集
训练和测试
有conf文件来训练和测试每个模型。在src文件夹内,例如要训练CONLL04:
train.py model=rebel_model data=conll04_data train=conll04_train
一旦模型训练完成,可以通过运行以下命令来评估检查点:
test.py model=rebel_model data=conll04_data train=conll04_train do_predict=True checkpoint_path="path_to_checkpoint"
src/model_saving.py可以用来将pytorch lightning检查点转换为hf transformers格式的模型和分词器。
演示
我们建议运行演示以测试REBEL。一旦模型文件在模型文件夹中解压缩,运行:
streamlit run demo.py
然后在浏览器中将显示一个演示。它接受自由输入以及数据/rebel/中的样本数据。
spaCy
您也可以使用spaCy(>=3.0)与REBEL一起使用,使您能够使用我们的系统以无缝接口处理完整端到端关系抽取。要添加REBEL作为自定义组件,您需要安装transformers库并:
import spacy
import spacy_component
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("rebel", after="senter", config={
'device':0, # GPU编号,若要使用CPU则为-1
'model_name':'Babelscape/rebel-large'} # 使用的模型,如果没有给出,将默认使用'Babelscape/rebel-large'
)
input_sentence = "Gràcia是西班牙巴塞罗那市的一个区。它包括Vila de Gràcia、Vallcarca i els Penitents、El Coll、La Salut和Camp d'en Grassot i Gràcia Nova五个社区。Gràcia的南面是Eixample区,西面是Sarrià-Sant Gervasi区,东面是Horta-Guinardó区。作为加泰罗尼亚生活的一个充满活力和多样性的飞地,Gràcia在1897年被正式并入巴塞罗那之前,几个世纪以来一直是一个独立的市镇。"
doc = nlp(input_sentence)
doc_list = nlp.pipe([input_sentence])
for value, rel_dict in doc._.rel.items():
print(f"{value}: {rel_dict}")
# (0, 8): {'relation': 'located in the administrative territorial entity', 'head_span': Gràcia, 'tail_span': Barcelona}
# (0, 10): {'relation': 'country', 'head_span': Gràcia, 'tail_span': Spain}
# (8, 0): {'relation': 'contains administrative territorial entity', 'head_span': Barcelona, 'tail_span': Gràcia}
# (8, 10): {'relation': 'country', 'head_span': Barcelona, 'tail_span': Spain}
# (17, 0): {'relation': 'located in the administrative territorial entity', 'head_span': Vila de Gràcia, 'tail_span': Gràcia}
# (21, 0): {'relation': 'located in the administrative territorial entity', 'head_span': Vallcarca i els Penitents, 'tail_span': Gràcia}
# (26, 0): {'relation': 'located in the administrative territorial entity', 'head_span': El Coll, 'tail_span': Gràcia}
# (29, 0): {'relation': 'located in the administrative territorial entity', 'head_span': La Salut, 'tail_span': Gràcia}
# (0, 46): {'relation': 'shares border with', 'head_span': Gràcia, 'tail_span': Eixample}
# (0, 51): {'relation': 'shares border with', 'head_span': Gràcia, 'tail_span': Sarrià-Sant Gervasi}
# (0, 59): {'relation': 'shares border with', 'head_span': Gràcia, 'tail_span': Horta-Guinardó}
# (46, 0): {'relation': 'shares border with', 'head_span': Eixample, 'tail_span': Gràcia}
# (46, 51): {'relation': 'shares border with', 'head_span': Eixample, 'tail_span': Sarrià-Sant Gervasi}
# (51, 0): {'relation': 'shares border with', 'head_span': Sarrià-Sant Gervasi, 'tail_span': Gràcia}
# (51, 46): {'relation': 'shares border with', 'head_span': Sarrià-Sant Gervasi, 'tail_span': Eixample}
# (51, 59): {'relation': 'shares border with', 'head_span': Sarrià-Sant Gervasi, 'tail_span': Horta-Guinardó}
## 数据集
TACRED不免费提供,但如何从中创建Re-TACRED的说明可以在[这里](https://github.com/gstoica27/Re-TACRED)找到。
对于CONLL04和ADE,可以使用[SpERT github](https://github.com/lavis-nlp/spert/blob/master/scripts/fetch_datasets.sh)上的脚本。
对于NYT,可以从[JointER github](https://github.com/yubowen-ph/JointER/tree/master/dataset/NYT-multi/data)下载数据集。
最后,DocRED for RE可以在[JEREX github](https://github.com/lavis-nlp/jerex/blob/main/scripts/fetch_datasets.sh)找到。
<br>
# REDFM
## RED<sup>FM</sup>:一个过滤的多语言关系抽取数据集
![image](https://github.com/Babelscape/rebel/assets/26126169/979fb259-00e3-462b-a720-bc818079b0ce)
这是ACL2023论文 RED<sup>FM</sup>: a Filtered and Multilingual Relation Extraction Dataset 的存储库。我们提供了两个新资源以及多个REBEL的多语言版本。论文可以在[这里]([docs/EMNLP_2021_REBEL__Camera_Ready_.pdf](https://arxiv.org/abs/2306.09802))找到。如果您使用了任何这些资源,请在您的论文中引用此工作:
@inproceedings{huguet-cabot-et-al-2023-redfm-dataset,
title = "RED$^{\rm FM}$: a Filtered and Multilingual Relation Extraction Dataset",
author = "Huguet Cabot, Pere-Llu{\'\i}s and Tedeschi, Simone and Ngonga Ngomo, Axel-Cyrille and
Navigli, Roberto",
booktitle = "Proc. of the 61st Annual Meeting of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2306.09802",
}
## 数据集
- [RED<sup>FM</sup>](https://huggingface.co/datasets/Babelscape/SREDFM)是一个人工过滤的关系抽取数据集,涵盖阿拉伯语、中文、法语、英语、德语、意大利语和西班牙语的32种关系类型。你可以在[这里](https://huggingface.co/datasets/Babelscape/REDFM)找到。
- [SRED<sup>FM</sup>](https://huggingface.co/datasets/Babelscape/SREDFM)是一个机器过滤的关系抽取数据集,涵盖17种不同语言和多达400种关系类型。你可以在[这里](https://huggingface.co/datasets/Babelscape/SREDFM)找到。SREDFM是使用Triplet Critic过滤的,你可以在[这里](https://huggingface.co/Babelscape/mdeberta-v3-base-triplet-critic-xnli)找到。
## 模型
- [mREBEL<sub>400</sub>](https://huggingface.co/Babelscape/mrebel-large)。这个版本的mREBEL在17种语言的400种关系类型上使用所有SRED<sup>FM</sup>训练,包括实体类型。可以作为独立模型使用,也可以用于微调你们的多语言关系抽取数据集。
- [mREBEL<sub>32</sub>](https://huggingface.co/Babelscape/mrebel-large-32)。此版本仅在覆盖RED<sup>FM</sup>的32种关系类型的SRED<sup>FM</sup>子集上训练。
- [mREBEL<sub>B400</sub>](https://huggingface.co/Babelscape/mrebel-large)。与mREBEL<sub>400</sub>相同,但在M2M100上训练,而非mBART,以提供具有更小占用的基础版本。
# 许可证
REBEL和RED<sup>FM</sup>的代码根据CC BY-SA-NC 4.0许可证进行许可。许可证的文本可以在[这里](https://github.com/Babelscape/rebel/blob/master/LICENSE.md)找到。