bm25s - 为文本检索提供极速Python BM25实现

BM25S 项目介绍

什么是 BM25S?

BM25S 是一个用纯 Python 实现的超快速的 BM25 算法库，依托于 Scipy 稀疏矩阵技术来提升性能。BM25 是文本检索任务中使用广泛的排序函数，也是像 Elasticsearch 这样的搜索服务的核心组件。

BM25S 的特点

BM25S 以以下两大特点闻名：

快速：借助于 Scipy 稀疏矩阵，BM25S 可以在查询时极快地评分，为所有文档中的词语预先计算得分，从而大幅提高性能。
简单：安装 BM25S 非常简单，可以通过 pip 快速安装并使用。它没有对 Java 或 Pytorch 的依赖，只需要 Scipy 和 Numpy，以及可选的轻量级词干化依赖。

安装与快速开始

安装 BM25S

用户可以通过下面的命令来安装 BM25S：

pip install bm25s

对于更好的检索结果，用户还可以安装词干化工具：

pip install bm25s[full]
pip install PyStemmer
pip install jax[cpu]

快速开始

下面是一个简短的例子，展示如何使用 BM25S 来检索文本：

import bm25s
import Stemmer

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)

retriever = bm25s.BM25()
retriever.index(corpus_tokens)

query = "does the fish purr like a cat?"
query_tokens = bm25s.tokenize(query, stemmer=stemmer)

results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=2)

for i in range(results.shape[1]):
    doc, score = results[0, i], scores[0, i]
    print(f"Rank {i+1} (score: {score:.2f}): {doc}")