bert-base-romanian-ner - 罗马尼亚语命名实体识别的高级BERT模型

项目介绍：bert-base-romanian-ner

模型描述

bert-base-romanian-ner是一款经过微调的BERT模型，专用于命名实体识别（NER，Named Entity Recognition）任务，拥有行业前沿的表现。该模型能够识别15种实体类型，包括：人名、地缘政治实体、位置、组织、语言、民族宗教政治实体、日期时间、期间、数量、货币、数字、序数、设施、艺术作品和事件。

具体来说，该模型是基于bert-base-romanian-cased-v1模型，通过在RONEC版本2.0上进行微调而成的。RONEC v2.0包含12330个句子，超过50万个标注标记，总计80,283个不同的已标注实体。该模型使用BIO2风格进行注释，这意味着会为实体生成以“B-”和“I-”开头的标签，如‘B-PERSON’，‘I-PERSON’等。标签‘O’表示其他非实体内容。

使用方法

用户可以通过两种方式来使用这个模型：

直接在Transformers中使用：

可以通过Transformers库中的pipeline工具来进行NER处理，这需要处理单词标记在多子标记情况下的不同标签。

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-ner")
model = AutoModelForTokenClassification.from_pretrained("dumitrescustefan/bert-base-romanian-ner")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "Alex cumpără un bilet pentru trenul 3118 în direcția Cluj cu plecare la ora 13:00."
ner_results = nlp(example)
print(ner_results)

在Python包中使用

可以通过命令pip install roner安装相关Python包，只需简单几步便可自动处理单词-标记对齐、长序列等。详细信息参见roner GitHub页面。

重要提示

使用这些模型处理文本之前，请务必对文本进行消毒处理。需要将罗马尼亚语中的下加符号（如ti和si）替换为逗号形式（ț和ș），可以使用以下命令：

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

NER评估结果

以下是模型在命名实体识别任务中的评估结果：

 'test/ent_type': 0.9276865720748901,
 'test/exact': 0.9118986129760742,
 'test/partial': 0.9356381297111511,
 'test/strict': 0.8921924233436584

语料库详情

该语料库包含以下类别及其在训练/验证/测试集中的分布：

类别	总计	训练集		验证集		测试集
	#	#	%	#	%	#	%
PERSON	26130	19167	73.35	2733	10.46	4230	16.19
GPE	11103	8193	73.79	1182	10.65	1728	15.56
LOC	2467	1824	73.94	270	10.94	373	15.12
ORG	7880	5688	72.18	880	11.17	1312	16.65
LANGUAGE	467	342	73.23	52	11.13	73	15.63
NAT_REL_POL	4970	3673	73.90	516	10.38	781	15.71
DATETIME	9614	6960	72.39	1029	10.7	1625	16.9
PERIOD	1188	862	72.56	129	10.86	197	16.58
QUANTITY	1588	1161	73.11	181	11.4	246	15.49
MONEY	1424	1041	73.10	159	11.17	224	15.73
NUMERIC	7735	5734	74.13	814	10.52	1187	15.35
ORDINAL	1893	1377	72.74	212	11.2	304	16.06
FACILITY	1126	840	74.6	113	10.04	173	15.36
WORK_OF_ART	1596	1157	72.49	176	11.03	263	16.48
EVENT	1102	826	74.95	107	9.71	169	15.34

引用信息

如需引用RONEC语料库的研究成果，请参考如下文献，即便使用的是该语料库的v2版本，感谢作者的贡献：

Dumitrescu, Stefan Daniel, Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019)

或使用.bibtex格式：

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}