Voy

用Rust编写的WASM向量相似度搜索引擎

voy: WebAssembly中的向量相似度搜索引擎

体积小：gzip压缩后75KB，brotli压缩后69KB。
速度快：为用户创造最佳搜索体验。Voy使用k-d树进行索引并提供快速搜索。
可树摇：优化包大小，并为现代Web API（如Web Workers）启用异步功能。
可恢复：随时随地生成可移植的嵌入索引。
全球化：设计用于在CDN边缘服务器上部署和运行。

🚜 开发中

Voy正在积极开发中。因此，API尚不稳定。请注意，在即将发布的1.0版本之前可能会有破坏性更改。

以下是我们正在进行的工作预览：

WebAssembly中内置文本转换：目前，voy依赖于JavaScript库（如transformers.js）来生成文本嵌入。详情请参阅使用方法。

索引更新：当前在资源更新时需要重建索引。

TypeScript支持：由于WASM工具的限制，复杂数据类型无法自动生成。

安装

# 使用npm
npm i voy-search

# 使用Yarn
yarn add voy-search

# 使用pnpm
pnpm add voy-search

API

`class Voy`

Voy类封装了一个索引，并公开了Voy提供的所有公共方法。

class Voy {
  /**
   * 通过实例化一个资源，Voy将构建索引。如果资源不存在，
   * 它将构建一个空索引。稍后调用Voy.index()将覆盖空索引。
   * @param {Resource | undefined} resource
   */
  constructor(resource?: Resource);
  /**
   * 索引给定资源。Voy.index()设计用于在没有资源的情况下实例化Voy实例的场景。
   * 它将覆盖现有索引。如果您想保留现有索引，可以使用Voy.add()将资源添加到索引中。
   * @param {Resource} resource
   */
  index(resource: Resource): void;
  /**
   * 使用给定的查询嵌入搜索前k个结果。
   * @param {Float32Array} query: 查询嵌入
   * @param {number} k: 搜索结果中的项目数
   * @returns {SearchResult}
   */
  search(query: Float32Array, k: number): SearchResult;
  /**
   * 将给定资源添加到索引中。
   * @param {Resource} resource
   */
  add(resource: Resource): void;
  /**
   * 从索引中移除给定资源。
   * @param {Resource} resource
   */
  remove(resource: Resource): void;
  /**
   * 从索引中移除所有资源。
   */
  clear(): void;
  /**
   * @returns {number}
   */
  size(): number;
  /**
   * 序列化Voy实例。
   * @returns {string}
   */
  serialize(): string;
  /**
   * 将序列化的索引反序列化为Voy实例。
   * @param {string} serialized_index
   * @returns {Voy}
   */
  static deserialize(serialized_index: string): Voy;
}

interface Resource {
  embeddings: Array<{
    id: string; // 资源的id
    title: string; // 资源的标题
    url: string; // 资源的url
    embeddings: number[]; // 资源的嵌入
  }>;
}

interface SearchResult {
  neighbors: Array<{
    id: string; // 资源的id
    title: string; // 资源的标题
    url: string; // 资源的url
  }>;
}

独立函数

除了Voy类，Voy还将所有实例方法作为独立函数导出。

`index(resource: Resource): SerializedIndex`

它索引给定的资源并返回序列化的索引。

参数

interface Resource {
  embeddings: Array<{
    id: string; // 资源的id
    title: string; // 资源的标题
    url: string; // 资源的url
    embeddings: number[]; // 资源的嵌入
  }>;
}

返回值

type SerializedIndex = string;

`search(index: SerializedIndex, query: Query, k: NumberOfResult): SearchResult`

它反序列化给定的索引并搜索查询的k个最近邻。

参数

type SerializedIndex = string;

type Query = Float32Array; // 搜索查询的嵌入

type NumberOfResult = number; // 返回的K个顶部结果

返回值

interface SearchResult {
  neighbors: Array<{
    id: string; // 资源的id
    title: string; // 资源的标题
    url: string; // 资源的url
  }>;
}

`add(index: SerializedIndex, resource: Resource): SerializedIndex`

它将资源添加到索引中并返回更新后的序列化索引。

参数

type SerializedIndex = string;

interface Resource {
  embeddings: Array<{
    id: string; // 资源的id
    title: string; // 资源的标题
    url: string; // 资源的url
    embeddings: number[]; // 资源的嵌入向量
  }>;
}

返回值

type SerializedIndex = string;

`remove(index: SerializedIndex, resource: Resource): SerializedIndex`

它从索引中移除资源并返回更新后的序列化索引。

参数

type SerializedIndex = string;

interface Resource {
  embeddings: Array<{
    id: string; // 资源的id
    title: string; // 资源的标题
    url: string; // 资源的url
    embeddings: number[]; // 资源的嵌入向量
  }>;
}

返回值

type SerializedIndex = string;

`clear(index: SerializedIndex): SerializedIndex`

它从索引中移除所有项目并返回一个空的序列化索引。

参数

type SerializedIndex = string;

返回值

type SerializedIndex = string;

`size(index: SerializedIndex): number;`

它返回索引的大小。

参数

type SerializedIndex = string;

使用方法

与Transformers一起使用

目前，voy依赖于像transformers.js和web-ai这样的库来为文本生成嵌入向量：

import { TextModel } from "@visheratin/web-ai";

const { Voy } = await import("voy-search");

const phrases = [
  "那是一个非常快乐的人",
  "那是一只快乐的狗",
  "今天是个阳光明媚的日子",
];
const query = "那是一个快乐的人";

// 创建文本嵌入
const model = await (await TextModel.create("gtr-t5-quant")).model;
const processed = await Promise.all(phrases.map((q) => model.process(q)));

// 使用voy为嵌入建立索引
const data = processed.map(({ result }, i) => ({
  id: String(i),
  title: phrases[i],
  url: `/path/${i}`,
  embeddings: result,
}));
const resource = { embeddings: data };
const index = new Voy(resource);

// 对查询嵌入进行相似度搜索
const q = await model.process(query);
const result = index.search(q.result, 1);

// 显示搜索结果
result.neighbors.forEach((result) =>
  console.log(`✨ voy相似度搜索结果: "${result.title}"`)
);

多个索引

import { TextModel } from "@visheratin/web-ai";

const { Voy } = await import("voy-search");
const phrases = [
  "那是一个非常快乐的人",
  "那是一只快乐的狗",
  "今天是个阳光明媚的日子",
  "向日葵正在盛开",
];
const model = await (await TextModel.create("gtr-t5-quant")).model;
const processed = await Promise.all(phrases.map((q) => model.process(q)));

const data = processed.map(({ result }, i) => ({
  id: String(i),
  title: phrases[i],
  url: `/path/${i}`,
  embeddings: result,
}));
const resourceA = { embeddings: data.slice(0, 2) };
const resourceB = { embeddings: data.slice(2) };

const indexA = new Voy(resourceA);
const indexB = new Voy(resourceB);