SimAlign: 基于相似度的词对齐器

对齐示例

SimAlign 是一个高质量的词对齐工具，它使用静态和上下文嵌入，无需平行训练数据。

下表显示了它与流行的统计对齐模型的对比：

	英-捷	英-德	英-波	英-法	英-印	英-罗
fast-align	.78	.71	.46	.84	.38	.68
eflomal	.85	.77	.63	.93	.52	.72
mBERT-Argmax	.87	.81	.67	.94	.55	.65

显示的是 F1 值，为子词和词级别的最大值。更多详情请参阅论文。

安装和使用

已在 Python 3.7、Transformers 3.1.0、Torch 1.5.0 环境下测试。Networkx 2.4 是可选的（仅用于匹配算法）。完整依赖列表请参见 setup.py。有关 transformers 的安装，请参阅他们的仓库。

下载仓库使用，或者通过 PyPi 安装

pip install simalign

或直接通过 pip 从 GitHub 安装

pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign

使用我们代码的示例：

from simalign import SentenceAligner

# 创建我们模型的实例。
# 可以在构造函数中指定嵌入模型和所有对齐设置。
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# 源语句和目标语句应该被分词成单词。
src_sentence = ["This", "is", "a", "test", "."]
trg_sentence = ["Das", "ist", "ein", "Test", "."]

# 输出是一个包含不同匹配方法的字典。
# 每种方法都有一个列表，表示对齐单词的索引对（对齐从零开始索引）。
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])

# 预期输出：
# mwmf (Match): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# inter (ArgMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# itermax (IterMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

更多使用我们代码的示例，请参见 scripts/align_example.py。

演示

在线演示可在此处获得。

黄金标准

论文中使用的黄金标准链接如下：

语言对	引用	类型	链接
英-捷	Marecek et al. 2008	黄金对齐	http://ufal.mff.cuni.cz/czech-english-manual-word-alignment
英-德	基于 EuroParl	黄金对齐	www-i6.informatik.rwth-aachen.de/goldAlignment/
英-波	Tvakoli et al. 2014	黄金对齐	http://eceold.ut.ac.ir/en/node/940
英-法	WPT2003, Och et al. 2000,	黄金对齐	http://web.eecs.umich.edu/~mihalcea/wpt/
英-印	WPT2005	黄金对齐	http://web.eecs.umich.edu/~mihalcea/wpt05/
英-罗	WPT2005 Mihalcea et al. 2003	黄金对齐	http://web.eecs.umich.edu/~mihalcea/wpt05/

评估脚本

使用 scripts/calc_align_score.py 评估输出的对齐结果。

黄金对齐文件应与 SimAlign 输出格式相同。黄金标准中的确定对齐边用 '-' 连接源和目标索引，可能的边用 'p' 连接索引。有关英-德的样本平行句子及其黄金对齐，请参见 samples。

出版物

如果您使用此代码，请引用

@inproceedings{jalili-sabet-etal-2020-simalign,
    title = "{S}im{A}lign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings",
    author = {Jalili Sabet, Masoud  and
      Dufter, Philipp  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.147",
    pages = "1627--1643",
}