doctran - 基于LLM的智能文档转换和处理框架

🐛 Doctran

文档转换框架 - 使用大语言模型通过自然语言指令处理复杂字符串。

某些应用需要解析文档，其中人类级别的判断比速度更重要。例如标记交易或从文本中提取语义信息。在这些情况下，正则表达式可能过于不灵活，而大语言模型则是理想的选择。可以将Doctran视为一个由大语言模型驱动的黑盒，杂乱的字符串进入，整洁、干净、带标签的字符串输出。另一种理解方式是，它是OpenAI函数调用功能的模块化、声明式封装，显著改善了开发者体验。

Doctran由jasonwcfan轻度维护。

示例

克隆或下载examples.ipynb以获取交互式演示。

Doctran将杂乱、非结构化的文本

<doc type="Confidential Document - For Internal Use Only">
<metadata>
<date> &#x004A; &#x0075; &#x006C; &#x0079; &#x0020; &#x0031; , &#x0032; &#x0030; &#x0032; &#x0033; </date>
<subject> Updates and Discussions on Various Topics; </subject>
</metadata>
<body>
<p>Dear Team,</p>
<p>I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.</p>
<section>
<header>Security and Privacy Measures</header>
<p>As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: john.doe&#64;example.com) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at security&#64;example.com.</p>
</section>
<section>
<header>HR Updates and Employee Benefits</header>
<p>Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: &#x0030; &#x0034; &#x0039; - &#x0034; &#x0035; - &#x0035; &#x0039; &#x0032; &#x0038;) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: &#x0034; &#x0031; 
...

转换为针对向量搜索优化的半结构化文档。

{
  "topics": ["安全和隐私", "人力资源更新", "营销", "研发"],
  "summary": "该文档讨论了安全措施、人力资源、营销计划和研发项目的更新。表扬了John Doe增强网络安全的工作，欢迎新团队成员，并表彰了Jane Smith在客户服务方面的表现。还提到了员工福利计划的开放注册期，感谢Sarah Thompson在社交媒体方面的努力，并宣布7月15日的产品发布会。David Rodriguez在研发方面的贡献得到了认可。文档强调了保密的重要性。",
  "contact_info": [
    {
      "name": "John Doe",
      "contact_info": {
        "phone": "",
        "email": "john.doe@example.com"
      }
    },
    {
      "name": "Michael Johnson",
      "contact_info": {
        "phone": "418-492-3850",
        "email": "michael.johnson@example.com"
      }
    },
    {
      "name": "Sarah Thompson",
      "contact_info": {
        "phone": "415-555-1234",
        "email": ""
      }
    }
  ],
  "questions_and_answers": [
    {
      "question": "这份文档的目的是什么？",
      "answer": "这份文档的目的是提供重要更新并讨论需要团队关注的各种主题。"
    },
    {
      "question": "谁因增强公司网络安全而受到表扬？",
      "answer": "IT部门的John Doe因增强公司网络安全而受到表扬。"
    }
  ]
}

开始使用

pip install doctran

from doctran import Doctran

doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="你的内容字符串")

链式转换

Doctran旨在使文档转换的链式操作变得简单。例如，你可能希望先对文档中的所有个人身份信息进行编辑，然后再将其发送给OpenAI进行摘要。

链式转换中的顺序很重要——首先调用的转换将首先执行，其结果将传递给下一个转换。

document = await document.redact(entities=["EMAIL_ADDRESS", "PHONE_NUMBER"]).extract(properties).summarize().execute()

Doctran转换器

提取

给定任何有效的JSON模式，使用OpenAI函数调用从文档中提取结构化数据。

from doctran import ExtractProperty

properties = ExtractProperty(
    name="millenial_or_boomer", 
    description="预测这份文档是由千禧一代还是婴儿潮一代撰写的",
    type="string",
    enum=["millenial", "boomer"],
    required=True
)
document = await document.extract(properties=properties).execute()

编辑

使用spaCy模型从文档中删除姓名、电子邮件、电话号码和其他敏感信息。在本地运行以避免将敏感数据发送到第三方API。

document = await document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"]).execute()

摘要

总结文档中的信息。可以传递token_limit来配置摘要的大小，但OpenAI可能不会严格遵守。

document = await document.summarize().execute()

精炼

从文档中删除所有信息，除非与特定主题集相关。

document = await document.refine(topics=['marketing', 'meetings']).execute()

翻译

将文本翻译成另一种语言

document = await document.translate(language="spanish").execute()

问答

将文档中的信息转换为问答格式。最终用户查询通常以问题的形式出现，因此将信息转换为问题并从这些问题创建索引，在使用向量数据库进行上下文检索时通常会产生更好的结果。

document = await document.interrogate().execute()

贡献

Doctran欢迎贡献！最好的开始方式是贡献一个新的文档转换器。不依赖API调用（例如OpenAI）的转换器特别有价值，因为它们可以快速运行且不需要任何外部依赖。

添加新的doctran转换器

贡献新的转换器很简单。

添加一个新的类，扩展DocumentTransformer或OpenAIDocumentTransformer
实现__init__和transform方法
在DocumentTransformationBuilder和Document类中添加相应的方法以启用链式操作