UCall

JSON远程过程调用库
速度比FastAPI快100倍

现代大多数网络通信要么建立在缓慢且模糊的REST API之上，要么建立在不必要复杂的gRPC之上。例如，FastAPI看起来非常易于使用。我们的目标是达到同样甚至更简单的使用体验。

FastAPI UCall

FastAPI	UCall
`pip install fastapi uvicorn`	`pip install ucall`
`from fastapi import FastAPI import uvicorn server = FastAPI() @server.get('/sum') def sum(a: int, b: int): return a + b uvicorn.run(...)`	`from ucall.posix import Server # 在Linux 5.19+上使用 from ucall.uring import Server server = Server() @server def sum(a: int, b: int): return a + b server.run()`

pip install fastapi uvicorn

pip install ucall

from fastapi import FastAPI
import uvicorn

server = FastAPI()

@server.get('/sum')
def sum(a: int, b: int):
    return a + b

uvicorn.run(...)

from ucall.posix import Server
# 在Linux 5.19+上使用 from ucall.uring import Server

server = Server()

@server
def sum(a: int, b: int):
    return a + b

server.run()

在最新的8核CPU上处理一个简单的FastAPI调用需要超过一毫秒的时间。在这段时间内，光可以通过光纤传输300公里，到达邻近的城市或国家。那么UCall与FastAPI和gRPC相比如何呢？

设置	🔁	服务器	单客户端延迟	32客户端吞吐量
Fast API over REST	❌	🐍	1'203 μs	3'184 rps
Fast API over WebSocket	✅	🐍	86 μs	11'356 rps ¹
gRPC ²	✅	🐍	164 μs	9'849 rps

UCall with POSIX	❌	C	62 μs	79'000 rps
UCall with io_uring	✅	🐍	40 μs	210'000 rps
UCall with io_uring	✅	C	22 μs	231'000 rps

表格说明

所有基准测试都在AWS的通用实例上进行，使用Ubuntu 22.10 AMI。这是第一个带有Linux内核5.19的主要AMI，为网络操作提供了更广泛的io_uring支持。这些具体数字是在配备Graviton 3芯片的c7g.metal高性能实例上获得的。

🔁列标记在后续请求中是否重用TCP/IP连接。
"服务器"列定义了服务器实现所使用的编程语言。
"延迟"列报告了发送请求到接收响应之间的时间。μ代表微秒，μs随后表示微秒。
"吞吐量"列报告了当从同一台机器上运行的多个客户端进程查询同一服务器应用程序时，每秒处理的请求数。

¹ FastAPI无法使用WebSockets处理并发请求。

² 我们尝试用gRPC生成C++后端，但其数字令人怀疑地并没有更好。还有一个异步gRPC选项未尝试。

这怎么可能？！

一个只有几千行代码的小项目如何能与两个最知名的网络库竞争？ UCall站在巨人的肩膀上：

使用io_uring实现无中断IO。
- 5.1+版本支持io_uring_prep_read_fixed。
- 5.19+版本支持io_uring_prep_accept_direct。
- 5.19+版本支持io_uring_register_files_sparse。
- 5.19+版本可选支持IORING_SETUP_COOP_TASKRUN。
- 6.0+版本可选支持IORING_SETUP_SINGLE_ISSUER。
使用SIMD加速的解析器和手动内存控制。
- 使用[simdjson][simdjson]解析JSON，速度比gRPC解包ProtoBuf更快。
- 使用[Turbo-Base64][base64]从Base64格式解码二进制值。
- 使用[picohttpparser][picohttpparser]导航HTTP头。

你已经看到了往返延迟...，每秒请求吞吐量...，想看看带宽吗？自己试试吧！

@server
def echo(data: bytes):
    return data

比FastAPI功能更强大

FastAPI支持原生类型，而UCall支持numpy.ndarray、PIL.Image和其他自定义类型。这在构建实际应用或想要部署多模态AI时非常有用，就像我们使用UForm所做的那样。

from ucall.rich_posix import Server
import ufrom

server = Server()
model = uform.get_model('unum-cloud/uform-vl-multilingual')

@server
def vectorize(description: str, photo: PIL.Image.Image) -> numpy.ndarray:
    image = model.preprocess_image(photo)
    tokens = model.preprocess_text(description)
    joint_embedding = model.encode_multimodal(image=image, text=tokens)

    return joint_embedding.cpu().detach().numpy()

我们还有自己的可选Client类，可以帮助处理这些自定义类型。

from ucall.client import Client

client = Client()
# 显式JSON-RPC调用：
response = client({
    'method': 'vectorize',
    'params': {
        'description': description,
        'image': image,
    },
    'jsonrpc': '2.0',
    'id': 100,
})
# 或者使用语法糖：
response = client.vectorize(description=description, image=image)

类似cURL的命令行界面

除了Python Client，我们还提供了一个易于使用的命令行界面，可通过pip install ucall安装。它允许你调用远程服务器、上传文件，并直接支持图像和NumPy数组。将前面的例子转换为Bash脚本，在同一台机器上调用服务器：

ucall vectorize description='产品描述' -i image=./本地/路径.png

调用远程服务器：

ucall vectorize description='产品描述' -i image=./本地/路径.png --uri 0.0.0.0 -p 8545

打印文档，使用ucall -h：

用法: ucall [-h] [--uri URI] [--port PORT] [-f [FILE ...]] [-i [IMAGE ...]] [--positional [POSITIONAL ...]] method [kwargs ...]

UCall客户端命令行界面

位置参数:
  method                方法名
  kwargs                方法参数
选项:
  -h, --help            显示此帮助信息并退出
  --uri URI             服务器 URI
  --port PORT           服务器端口
  -f [FILE ...], --file [FILE ...]
                        方法位置参数
  -i [IMAGE ...], --image [IMAGE ...]
                        方法位置参数
  --positional [POSITIONAL ...]
                        方法位置参数

您也可以显式注释类型，以区分整数、浮点数和字符串，避免歧义。

ucall auth id=256 ucall auth id:int=256 ucall auth id:str=256


免费层吞吐量

我们将带宽测量留给爱好者，但会分享一些更多数字。
一般逻辑是，您无法从免费层机器中挤出高性能。
目前 AWS 提供以下选项:老式 Intel 和新的 Graviton 2 芯片上的 `t2.micro` 和 `t4g.small`。
这个库非常快,只需要 1 个核心就能运行,所以您甚至可以在小型免费层服务器上运行快速服务器!

| 设置                    |   🔁   | 服务器  | 客户端  | `t2.micro` | `t4g.small` |
| :---------------------- | :---: | :----: | :-----: | ---------: | ----------: |
| Fast API over REST      |   ❌   |   🐍    |    1    |    328 rps |     424 rps |
| Fast API over WebSocket |   ✅   |   🐍    |    1    |  1'504 rps |   3'051 rps |
| gRPC                    |   ✅   |   🐍    |    1    |  1'169 rps |   1'974 rps |
|                         |       |        |         |            |             |
| UCall with POSIX        |   ❌   |   C    |    1    |  1'082 rps |   2'438 rps |
| UCall with io_uring     |   ✅   |   C    |    1    |          - |   5'864 rps |
| UCall with POSIX        |   ❌   |   C    |   32    |  3'399 rps |  39'877 rps |
| UCall with io_uring     |   ✅   |   C    |   32    |          - |  88'455 rps |

在这种情况下,每个服务器都被同一可用区内的 1 个或 32 个其他实例的请求轰炸。
如果您想重现这些基准测试,请查看 GitHub 上的 [`sum` 示例][sum-examples]。

快速开始

对于 Python:

```sh
pip install ucall

对于 CMake 项目:

include(FetchContent)
FetchContent_Declare(
    ucall
    GIT_REPOSITORY https://github.com/unum-cloud/ucall
    GIT_SHALLOW TRUE
)
FetchContent_MakeAvailable(ucall)
include_directories(${ucall_SOURCE_DIR}/include)

C 的使用示例比 Python 更冗长。我们希望使它尽可能轻量级,并允许可选参数而不进行动态分配和命名查找。因此,与 Python 层不同,我们期望用户使用 ucall_param_named_i64() 及其兄弟函数手动从调用上下文中提取参数。

#include <cstdio.h>
#include <ucall/ucall.h>

static void sum(ucall_call_t call, ucall_callback_tag_t) {
    int64_t a{}, b{};
    char printed_sum[256]{};
    bool got_a = ucall_param_named_i64(call, "a", 0, &a);
    bool got_b = ucall_param_named_i64(call, "b", 0, &b);
    if (!got_a || !got_b)
        return ucall_call_reply_error_invalid_params(call);

    int len = snprintf(printed_sum, 256, "%ll", a + b);
    ucall_call_reply_content(call, printed_sum, len);
}

int main(int argc, char** argv) {

    ucall_server_t server{};
    ucall_config_t config{};

    ucall_init(&config, &server);
    ucall_add_procedure(server, "sum", &sum, NULL);
    ucall_take_calls(server, 0);
    ucall_free(server);
    return 0;
}

路线图