g2pW: 普通话图形到音素转换器
作者: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang 和 Yi-Ren Yeh
这是我们论文 g2pW: 用于普通话多音字消歧的条件加权 Softmax BERT (INTERSPEECH 2022) 的官方代码库。
新闻
- g2pW 已被纳入 PaddlePaddle/PaddleSpeech
- g2pW 已被纳入 mozillazg/pypinyin-g2pW
入门指南
依赖 / 安装
(本项目在 PyTorch 1.7.0、CUDA 10.1、Python 3.6 和 Ubuntu 16.04 环境下测试通过。)
-
安装 PyTorch
-
$ pip install g2pw
快速演示
>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter()
>>> sentence = '上校請技術人員校正FN儀器'
>>> conv(sentence)
[['ㄕㄤ4', 'ㄒㄧㄠ4', 'ㄑㄧㄥ3', 'ㄐㄧ4', 'ㄕㄨ4', 'ㄖㄣ2', 'ㄩㄢ2', 'ㄐㄧㄠ4', 'ㄓㄥ4', None, None, 'ㄧ2', 'ㄑㄧ4']]
>>> sentences = ['銀行', '行動']
>>> conv(sentences)
[['ㄧㄣ2', 'ㄏㄤ2'], ['ㄒㄧㄥ2', 'ㄉㄨㄥ4']]
加载离线模型
conv = G2PWConverter(model_dir='./G2PWModel-v2-onnx/', model_source='./path-to/bert-base-chinese/')
支持简体中文和拼音
>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
>>> conv('然而,他红了20年以后,他竟退出了大家的视线。')
[['ran2', 'er2', None, 'ta1', 'hong2', 'le5', None, None, 'nian2', 'yi3', 'hou4', None, 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', None]]
脚本
$ git clone https://github.com/GitYCC/g2pW.git
训练模型
例如,我们在 CPP 数据集上训练模型,如下所示:
$ bash cpp_dataset/download.sh
$ python scripts/train_g2p_bert.py --config configs/config_cpp.py
测试
$ python scripts/test_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--output_path output_pred.txt
预测
$ python scripts/predict_g2p_bert.py \
--config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
--checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
--sent_path cpp_dataset/test.sent \
--lb_path cpp_dataset/test.lb
检查点
引用
如需引用代码/数据/论文,请使用以下 BibTex:
@article{chen2022g2pw,
author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
title = {g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin},
journal={Proc. Interspeech 2022},
url = {https://arxiv.org/abs/2203.10430},
year = {2022}
}