polars - 多语言支持的高性能数据分析引擎

文档： Python - Rust - Node.js - R | StackOverflow： Python - Rust - Node.js - R | 用户指南 | Discord

Polars：Rust、Python、Node.js、R 和 SQL 中极速的 DataFrame

Polars 是一个基于 OLAP 查询引擎的 DataFrame 接口，使用 Rust 实现，并采用 Apache Arrow 列式格式作为内存模型。

惰性 | 即时执行
多线程
SIMD
查询优化
强大的表达式 API
混合流式处理（大于内存的数据集）
Rust | Python | NodeJS | R | ...

要了解更多信息，请阅读用户指南。

Python

>>> import polars as pl
>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...     }
... )

# 高度并行执行和非常富有表现力的查询语言
>>> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars") == "beetle").sum(),
...     pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),
...     pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),
... )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr ┆ sort_A_by_B │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        ┆ _by_fruits  │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           ┆ 4           │
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           ┆ 3           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           ┆ 5           │
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           ┆ 2           │
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┴─────────────┘

SQL

>>> df = pl.scan_csv("docs/data/iris.csv")
>>> ## 选项 1
>>> # 在 frame 级别运行 SQL 查询
>>> df.sql("""
... SELECT species,
...   AVG(sepal_length) AS avg_sepal_length
... FROM self
... GROUP BY species
... """).collect()
shape: (3, 2)
┌────────────┬──────────────────┐
│ species    ┆ avg_sepal_length │
│ ---        ┆ ---              │
│ str        ┆ f64              │
╞════════════╪══════════════════╡
│ Virginica  ┆ 6.588            │
│ Versicolor ┆ 5.936            │
│ Setosa     ┆ 5.006            │
└────────────┴──────────────────┘
>>> ## 选项 2
>>> # 使用 pl.sql() 在全局上下文中操作
>>> df2 = pl.LazyFrame({
...    "species": ["Setosa", "Versicolor", "Virginica"],
...    "blooming_season": ["Spring", "Summer", "Fall"]
...})
>>> pl.sql("""
... SELECT df.species,
...     AVG(df.sepal_length) AS avg_sepal_length,
...     df2.blooming_season
... FROM df
... LEFT JOIN df2 ON df.species = df2.species
... GROUP BY df.species, df2.blooming_season
... """).collect()

SQL 命令也可以直接从终端使用 Polars CLI 运行：

# 运行内联 SQL 查询
> polars -c "SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;"

# 交互式运行
> polars
Polars CLI v0.3.0
输入 .help 获取帮助。

> SELECT species, AVG(sepal_length) AS avg_sepal_length, AVG(sepal_width) AS avg_sepal_width FROM read_csv('docs/data/iris.csv') GROUP BY species;

更多信息请参阅 Polars CLI 仓库。

性能 🚀🚀

极速

Polars 非常快。事实上，它是可用的性能最佳的解决方案之一。请查看 TPC-H 基准测试结果。

轻量级

Polars 也非常轻量。它不需要任何依赖，这体现在导入时间上：

polars：70ms
numpy：104ms
pandas：520ms

处理大于内存的数据

如果你的数据不适合内存，Polars 的查询引擎能够以流式方式处理你的查询（或查询的部分）。这大大降低了内存需求，所以你可能能够在笔记本电脑上处理 250GB 的数据集。使用 collect(streaming=True) 来运行流式查询。（这可能会稍微慢一些，但仍然非常快！）

设置

Python

使用以下命令安装最新的 Polars 版本：

pip install polars

我们也有 conda 包（conda install -c conda-forge polars），但 pip 是安装 Polars 的首选方式。

安装带有所有可选依赖的 Polars。

pip install 'polars[all]'

你也可以只安装部分可选依赖。

pip install 'polars[numpy,pandas,pyarrow]'

有关可选依赖的更多详细信息，请参阅用户指南

要查看当前 Polars 版本和其所有可选依赖的完整列表，请运行：

pl.show_versions()

目前版本更新比较频繁（每周/每隔几天），所以定期更新 Polars 以获取最新的错误修复/功能可能不是个坏主意。

Rust

你可以从 crates.io 获取最新版本，或者如果你想使用最新的功能/性能改进，可以指向本仓库的 main 分支。

polars = { git = "https://github.com/pola-rs/polars", rev = "<可选 git 标签>" }

需要 Rust 版本 >=1.71。

贡献

想要贡献？阅读我们的贡献指南。

Python：从源代码编译 Polars

如果你想要最新版本或最大性能，你应该从源代码编译 Polars。

这可以通过按顺序执行以下步骤来完成：

安装最新的 Rust 编译器
安装 maturin：pip install maturin
cd py-polars 并选择以下其中一个：

make build-release，最快的二进制文件，编译时间非常长
make build-opt，快速二进制文件，带调试符号，编译时间长
make build-debug-opt，中等速度的二进制文件，带调试断言和符号，编译时间中等
make build，慢速二进制文件，带调试断言和符号，编译时间快

附加 -native（例如 make build-release-native）以启用针对你的 CPU 的进一步优化。但这会产生不可移植的二进制文件/wheel。

请注意，实现 Python 绑定的 Rust crate 被称为 py-polars，以区别于被包装的 Rust crate polars 本身。然而，Python 包和 Python 模块都被命名为 polars，所以你可以 pip install polars 和 import polars。

在 Python 中使用自定义 Rust 函数

使用 Rust 编译的 UDF 扩展 Polars 很容易。我们为 DataFrame 和 Series 数据结构提供了 PyO3 扩展。更多信息请参见 https://github.com/pola-rs/pyo3-polars。

处理更大规模...

你预计会有超过 2^32（约 42 亿）行数据吗？使用 bigidx 特性标志编译 Polars，或者对于 Python 用户，安装 pip install polars-u64-idx。

除非你遇到行数限制，否则不要使用这个，因为 Polars 的默认构建更快且消耗更少的内存。

旧版本支持

你想在旧 CPU（例如 2011 年之前的）上运行 Polars，或在 Apple Silicon 上使用 Rosetta 运行 x86-64 版本的 Python？安装 pip install polars-lts-cpu。这个版本的