text-dedup

GitHub

安装

pip install text-dedup

或

pip install git+https://github.com/ChenghaoMou/text-dedup

文档

Github Pages

特性

此仓库包含了一系列现成可用的文本去重脚本，或根据您的需求进行修改：

RETSim/UniSim，基于嵌入的近去重（仍在进行中）
MinHash + MinHashLSH，包括适用于大规模（TB级）数据集的spark实现
64或128位的SimHash
SuffixArray子字符串
布隆过滤器
精确哈希（文档级，行级/ccnet）

我对未来有很大的计划：

流处理的内存基准测试
数据集间的去重
用Python重写后缀数组
其他去重方法的集合：SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Minwise Hashing的快速准确优化密度化, 快速相似性素描

然而，我并不打算建立一个通用的去重库，这曾是这个仓库最初的目标。我会逐渐退休这个pypi包。原因在于每个使用案例可能会非常不同，需要仔细设计和考虑。我真诚地鼓励您先阅读脚本（它们相对较短），这样您就能理解在使用它时的利害关系。您可以用它来启动您自己的脚本，或仅将其作为参考。

鸣谢

此仓库受到以下项目的启发，并深受我参与BigScience (Apache 2.0) 和 BigCode (Apache 2.0) 所汲取的教训的影响。有一篇关于这段旅程的博客文章。欢迎反馈！

Datasketch (MIT)
simhash-py 和 simhash-cpp (MIT)
Deduplicating Training Data Makes Language Models Better (Apache 2.0)
Gaoya (MIT)

快速示例

Native PySpark

首先修改 text_dedup/minhash_spark.py 以适应您的项目和数据集！

假设您下载的数据集（parquet文件）在 "./temp-data" 下，可以使用本地计算资源来处理文件：

export PYSPARK_PYTHON="path to your python with scipy, xxhash, and numpy installed"
spark-submit --executor-memory 16g \
    --driver-memory 20g \
    --executor-cores 3 \
    --num-executors 2 \
    --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 \
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=./log4j.properties" \
    text_dedup/minhash_spark.py\
    --input "./temp-data" \
    --output "./temp-output" \
    --column "text" \
    --threshold 0.7

DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - 使用 B=25, R=10
DEBUG __main__ - 加载的文档数：88803
DEBUG __main__ - 参数 args.input='./temp-data'
DEBUG __main__ - 参数 args.output='./temp-output'
DEBUG __main__ - 参数 args.threshold=0.7
DEBUG __main__ - 参数 args.ngram_size=5
DEBUG __main__ - 参数 args.min_length=5
DEBUG __main__ - 参数 args.num_perm=250
DEBUG __main__ - 参数 args.column='text'
DEBUG __main__ - id                                                              : bigint
DEBUG __main__ - text                                                            : string
DEBUG __main__ - meta                                                            : struct<warc_headers:struct<warc-record-id:string,warc-date:string,content-type:string,content-length:int,warc-type:string,warc-identified-content-language:string,warc-refers-to:string,warc-target-uri:string,warc-block-digest:string>,identification:struct<label:string,prob:float>,annotations:array<string>,line_identifications:array<struct<label:string,prob:float>>>
DEBUG __main__ - __id__                                                          : bigint
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - 初始边数：52102
DEBUG __main__ - 边数据框架：52102
DEBUG __main__ - 顶点数据框架：50206
DEBUG __main__ - 分配数据框架：50206
DEBUG __main__ - 合并记录：88803
INFO  __main__ - 以 1 个分区和每个分区 44092 行保存
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------
DEBUG __main__ - 处理前的行数：    88803
DEBUG __main__ - 处理后的行数：     44092
DEBUG __main__ - 保留行的百分比：  49.65%
DEBUG __main__ - 输出目录：                   ./temp-output
DEBUG __main__ - 时间：                     68.80s
DEBUG __main__ - ------------------------------------------------------------------------------------------------------------------------

或者查看bigcode-v2/run.sh了解如何在GCP DataProc上运行作业。

UniSim（进行中）

基于Google的RETSim模型（Github，Arxiv），它是基于嵌入的近去重方法。

对于大规模数据集，可能需要GPU加速推理。

python text_dedup/ann_unisim.py --path truthful_qa --name generation --split validation --output temp --column question

输出：

INFO     加载数据集                    : 5.56s
INFO     索引数据集                   : 8.13s
INFO     聚类                          : 8.72s
INFO     过滤                          : 0.35s
INFO     保存                          : 0.01s
INFO     清理                          : 0.00s
INFO     总计                           : 22.77s
INFO     去重前                          : 817
INFO     去重后                          : 788

后缀数组子字符串精确去重

# 输入
python -m text_dedup.suffix_array \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_gl_dedup" \
    --column "text" \
    --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets" \
    --use_auth_token true

# 输出
INFO     加载                        : 2.75秒
INFO     预处理                      : 4.78秒
INFO     后缀数组                    : 98.29秒
INFO     自相似                      : 4.24秒
INFO     还原                        : 0.25秒
INFO     去重                        : 6.23秒
INFO     保存                        : 8.91秒
INFO     总计                        : 125.45秒
INFO     去重前                      : 180332342字节（88803）
INFO     去重后                      : 97646271字节（40404）

MinHash近去重

# 输入
python -m text_dedup.minhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000 \
  --use_auth_token true

# 输出
INFO     加载                         : 2.62秒
INFO     MinHash处理                  : 0.08秒
INFO     聚类                          : 2.20秒
INFO     过滤                         : 0.53秒
INFO     保存                          : 9.86秒
INFO     总计                         : 15.29秒
INFO     数据数量（去重前）            : 88803
INFO     数据数量（去重后）            : 44124（49.69%）
INFO     重复数                        : 44679（50.31%）
INFO     🤗 去重愉快 🤗

SimHash近去重

# 输入
python -m text_dedup.simhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/simhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000 \
  --use_auth_token true

# 输出
INFO     加载                         : 2.60秒
INFO     SimHash处理                  : 0.04秒
INFO     索引                          : 28.88秒
INFO     过滤                         : 0.88秒
INFO     保存                          : 10.41秒
INFO     总计                         : 42.80秒
INFO     数据数量（去重前）            : 88803
INFO     数据数量（去重后）            : 46163（51.98%）
INFO     重复数                        : 42640（48.02%）
INFO     🤗 去重愉快 🤗

精确哈希精确去重

# 输入
python -m text_dedup.exact_hash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/exact_hash/oscar_gl_dedup" \
    --column "text" \
    --batch_size 1000 \
    --use_auth_token true

# 输出
INFO     加载                       : 2.95s
INFO     处理                      : 3.79s
INFO     过滤                      : 0.10s
INFO     保存                      : 2.89s
INFO     总计                      : 9.72s
INFO     去重前                    : 88803
INFO     去重后                    : 47049

布隆过滤器精确去重

# 输入
python -m text_dedup.bloom_filter \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/bloom_filter/oscar_gl_dedup" \
    --error_rate 1e-5 \
    --column "text" \
    --use_auth_token true    --batch_size 1000

# 输出
INFO     加载                       : 2.72秒
INFO     处理                      : 4.84秒
INFO     过滤                      : 0.10秒
INFO     保存                      : 2.88秒
INFO     总计                      : 10.54秒
INFO     去重前                    : 88803
INFO     去重后                    : 47045

基准测试

[!注意] Spark实现对于小数据集有一些开销，所以我建议只有在拥有大型数据集和足够计算资源时使用脚本。

pinecone/core-2020-05-10-deduplication

查看tests/benchmark_core.py以进行重现。 <SOURCE_TEXT>

算法	精度 (重复)	召回率 (重复)	精度 (非重复)	召回率 (非重复)	宏 F1 分数	准确率	时间
UniSim	0.9307	0.8924	0.9055	0.9394	0.9181	0.9054	1305.79s
MinHash Spark	0.957	0.9445	0.9471	0.959	0.952	0.9202	691.77s
MinHash	0.9594	0.9445	0.9474	0.9616	0.9534	0.924	18.88s
SimHash	0.9042	0.721	0.792	0.9329	0.8481	0.8321	644.36s
完整标题	0.8302	0.5521	0.7098	0.9065	0.77	0.7456	-
完整标题匹配 ¹	0.830	0.50	0.709	0.992	0.757	0.746	-
Simhash 匹配 ¹	0.697	0.247	0.598	0.985	0.631	0.616	-
文档向量相似度 ¹	0.912	0.779	0.861	0.986	0.885	0.883	-
混合方法 ¹	0.908	0.828	0.899	0.979	0.904	0.903	-
LaBSE²	0.937	0.923	0.930	0.943	0.933	0.919	-
多语言 USE²	0.917	0.907	0.918	0.927	0.917	0.909	-
多语言 E5-Base²	0.931	0.908	0.919	0.939	0.924	0.920	-
MinHash + LSH²	0.929	0.902	0.915	0.938	0.921	0.918	-
RETSim 局部重复²	0.945	0.941	0.945	0.949	0.945	0.928	-
RETSim 近似重复²	0.928	0.937	0.942	0.934	0.935	0.926	-

NEWS-COPY

请查看 tests/benchmark_news.py 以复现。

NEWS-COPY 数据集上的调整兰德指数（ARI）:

模型/算法	ARI
SimHash	0.612
MinHash (Spark)	0.740
MinHash	0.742
RETSim 近似重复 + ANN*	0.051
n-gram ³	0.440
SimHash²	0.695
MinHash³	0.737
MinHash²	0.783
多语言 USE²	0.730
多语言 E5-Base²	0.742
S-BERT³	0.700
RETSim 局部重复²	0.831
RETSim 近似重复²	0.704
重排 ³	0.937
双编码器 ³	0.915

*: 我无法复现论文中的结果。

许可

Apache 2.0

引用

一般情况下，您可以引用此库如下：

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

spark 版本源自 BigCode (Apache 2.0) 和 BigScience (Apache 2.0)，如果需要，您可以引用原文：

@article{
kocetkov2023the,
title={The Stack: 3 {TB} of permissively licensed source code},
author={Denis Kocetkov and Raymond Li and Loubna Ben allal and Jia LI and Chenghao Mou and Yacine Jernite and Margaret Mitchell and Carlos Mu{\~n}oz Ferrandis and Sean Hughes and Thomas Wolf and Dzmitry Bahdanau and Leandro Von Werra and Harm de Vries},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=pxpbTdUEpD},
note={}
}

</SOURCE_TEXT>

安装

文档

特性

鸣谢

快速示例

基准测试

许可

引用

Footnotes