instructor-embedding - 指令微调的文本嵌入模型

项目介绍：Instructor-Embedding 项目

背景与概述

项目 Instructor-Embedding 的创建起因于原始代码库不再更新而进行的个人分支，目的是为了解决一些技术问题和功能提升。特别地，该项目修复了与 sentence-transformers 库(版本高于2.2.2)的兼容问题，实现了通过新的"snapshot download" API从 Huggingface 下载模型，并允许用户通过 cache_dir 参数指定模型下载的路径。

该项目背后的核心模型名为 Instructor👨‍🏫. 他是一种经过指令微调的文本嵌入模型，可以生成符合特定任务（如分类、检索、聚类、文本评估等）和领域（科学、金融等）需求的文本嵌入。值得注意的是，用户只需提供任务说明，即可在无需进一步微调的情况下生成嵌入。Instructor 在70个多样化的嵌入任务中达到了当前最佳水平。

主要更新

2023年1月：更新了代码结构，支持易于安装的软件包。
2022年12月：更新了包含困难负例的模型检查点。
2022年12月：发表了相关学术论文，并推出了完整代码、项目页面和模型检查点。

安装与使用

安装

要在本地机器上使用 INSTRUCTOR 嵌入模型，建议首先创建一个虚拟环境。可以使用以下命令设置：

conda env create -n instructor python=3.7
git clone https://github.com/HKUNLP/instructor-embedding
pip install -r requirements.txt

然后，安装 InstructorEmbedding 包：

pip install InstructorEmbedding

或者直接从源码安装：

pip install -e .

快速上手

下载一个预训练的模型，例如 hkunlp/instructor-large，并提供句子和自定义的任务指令给模型：

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

text_instruction_pairs = [
    {"instruction": "Represent the Science title:", "text": "3D ActionSLAM: wearable person tracking in multi-floor environments"},
    {"instruction": "Represent the Medicine sentence for retrieving a duplicate sentence:", "text": "Recent studies have suggested that statins..."}
]

texts_with_instructions = [
    [pair["instruction"], pair["text"]] for pair in text_instruction_pairs
]

customized_embeddings = model.encode(texts_with_instructions)