raptor - 利用递归树结构提升大规模文本检索效率的新技术

项目介绍：RAPTOR

RAPTOR，即递归抽象处理树组织检索，是一种针对提升语言模型信息检索能力的新方法。其独特之处在于，通过从文档中构建递归的树结构，可以更高效地利用上下文信息在大量文本中进行信息检索。这一方法解决了传统语言模型在处理大规模文本时常见的限制问题。

方法与实现

RAPTOR采用了一种创新的方法，构建起一种递归树形结构，从而可以根据所需的信息进行更有效的检索。这种方法不仅提高了检索速度，还在处理大规模文本时提升了上下文感知能力。有关详细的方法和实现细节，可以参考原始论文：RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval。

安装步骤

在使用RAPTOR之前，用户需要确保已安装Python 3.8或更高版本。接下来，通过git克隆RAPTOR的代码仓库并安装必要的依赖：

git clone https://github.com/parthsarthi03/raptor.git
cd raptor
pip install -r requirements.txt

基本使用

初始设置

用户需要设置自己的OpenAI API密钥，并初始化RAPTOR配置：

import os
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"

from raptor import RetrievalAugmentation

RA = RetrievalAugmentation()

将文档添加到树中

用户可以将文本文档添加到RAPTOR中以便进行索引：

with open('sample.txt', 'r') as file:
    text = file.read()
RA.add_documents(text)

问答

RAPTOR可以基于已索引的文档回答问题：

question = "How did Cinderella reach her happy ending?"
answer = RA.answer_question(question=question)
print("Answer: ", answer)

保存和加载树

用户可以将构建的树保存到指定路径，并在需要时重新加载：

SAVE_PATH = "demo/cinderella"
RA.save(SAVE_PATH)

RA = RetrievalAugmentation(tree=SAVE_PATH)
answer = RA.answer_question(question=question)

与其他模型的扩展

RAPTOR设计灵活，允许用户集成不同的模型进行摘要、问答和嵌入生成。

自定义摘要模型

用户可以通过扩展BaseSummarizationModel类来使用自定义的语言模型进行摘要：

from raptor import BaseSummarizationModel

class CustomSummarizationModel(BaseSummarizationModel):
    def __init__(self):
        pass

    def summarize(self, context, max_tokens=150):
        summary = "Your summary here"
        return summary

自定义问答模型

用户可以扩展BaseQAModel类来使用自定义的问答模型：

from raptor import BaseQAModel

class CustomQAModel(BaseQAModel):
    def __init__(self):
        pass

    def answer_question(self, context, question):
        answer = "Your answer here"
        return answer

自定义嵌入模型

同样地，用户可以通过扩展BaseEmbeddingModel类来使用不同的嵌入模型：

from raptor import BaseEmbeddingModel

class CustomEmbeddingModel(BaseEmbeddingModel):
    def __init__(self):
        pass

    def create_embedding(self, text):
        embedding = [0.0] * embedding_dim
        return embedding

与RAPTOR集成

在实现自定义模型后，用户可以将其与RAPTOR集成：

from raptor import RetrievalAugmentation, RetrievalAugmentationConfig

custom_summarizer = CustomSummarizationModel()
custom_qa = CustomQAModel()
custom_embedding = CustomEmbeddingModel()

custom_config = RetrievalAugmentationConfig(
    summarization_model=custom_summarizer,
    qa_model=custom_qa,
    embedding_model=custom_embedding
)

RA = RetrievalAugmentation(config=custom_config)

更多示例和配置指南将会在后续更新中提供。

参与贡献

RAPTOR是一个开源项目，欢迎各类贡献。不管是修复错误、添加新功能还是改进文档，项目团队都非常感谢。

许可证

RAPTOR在MIT许可证下发布，详细信息请参阅仓库中的许可证文件。

引用

如果RAPTOR在您的研究中有所助益，请如以下方式引用：

@inproceedings{sarthi2024raptor,
    title={RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval},
    author={Sarthi, Parth and Abdullah, Salman and Tuli, Aditi and Khanna, Shubh and Goldie, Anna and Manning, Christopher D.},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2024}
}

请留意更多示例、配置指南和更新。