.. image:: https://raw.githubusercontent.com/mars-project/mars/master/docs/source/images/mars-logo-title.png
|PyPI version| |Docs| |Build| |Coverage| |Quality| |License|
Mars 是一种基于张量的大规模数据计算统一框架,可以扩展 numpy、pandas、scikit-learn 和许多其他库。
Documentation
, 中文文档
安装
Mars 很容易安装,通过以下方式
.. code-block:: bash
pip install pymars
开发者安装
当你想为 Mars 贡献代码时,可以按照以下说明安装 Mars 进行开发:
.. code-block:: bash
git clone https://github.com/mars-project/mars.git
cd mars
pip install -e ".[dev]"
有关安装 Mars 的更多详细信息,请参见 Mars 文档中的 `installation <https://mars-project.readthedocs.io/en/latest/installation/index.html>`_ 部分。
架构概述
------------
.. image:: https://raw.githubusercontent.com/mars-project/mars/master/docs/source/images/architecture.png
快速开始
----------
通过以下方式在本地启动新运行时:
.. code-block:: python
>>> import mars
>>> mars.new_session()
或者连接到已初始化的 Mars 集群。
.. code-block:: python
>>> import mars
>>> mars.new_session('http://<web_ip>:<ui_port>')
Mars 张量
----------
Mars 张量提供了类似于 Numpy 的熟悉界面。
+-----------------------------------------------+-----------------------------------------------+
| **Numpy** | **Mars 张量** |
+-----------------------------------------------+-----------------------------------------------+
|.. code-block:: python |.. code-block:: python |
| | |
| import numpy as np | import mars.tensor as mt |
| N = 200_000_000 | N = 200_000_000 |
| a = np.random.uniform(-1, 1, size=(N, 2)) | a = mt.random.uniform(-1, 1, size=(N, 2)) |
| print((np.linalg.norm(a, axis=1) < 1) | print(((mt.linalg.norm(a, axis=1) < 1) |
| .sum() * 4 / N) | .sum() * 4 / N).execute()) |
| | |
+-----------------------------------------------+-----------------------------------------------+
|.. code-block:: |.. code-block:: |
| | |
| 3.14174502 | 3.14161908 |
| CPU times: user 11.6 s, sys: 8.22 s, | CPU times: user 966 ms, sys: 544 ms, |
| total: 19.9 s | total: 1.51 s |
| Wall time: 22.5 s | Wall time: 3.77 s |
| | |
+-----------------------------------------------+-----------------------------------------------+
Mars 可以利用多个核心,即使在笔记本电脑上,也可以在分布式环境下运行得更快。
Mars DataFrame
--------------
Mars DataFrame 提供了类似于 pandas 的熟悉界面。
+-----------------------------------------+-----------------------------------------+
| **Pandas** | **Mars DataFrame** |
+-----------------------------------------+-----------------------------------------+
|.. code-block:: python |.. code-block:: python |
| | |
| import numpy as np | import mars.tensor as mt |
| import pandas as pd | import mars.dataframe as md |
| df = pd.DataFrame( | df = md.DataFrame( |
| np.random.rand(100000000, 4), | mt.random.rand(100000000, 4), |
| columns=list('abcd')) | columns=list('abcd')) |
| print(df.sum()) | print(df.sum().execute()) |
| | |
+-----------------------------------------+-----------------------------------------+
|.. code-block:: |.. code-block:: |
| | |
| CPU times: user 10.9 s, sys: 2.69 s, | CPU times: user 1.21 s, sys: 212 ms, |
| total: 13.6 s | total: 1.42 s |
| Wall time: 11 s | Wall time: 2.75 s |
+-----------------------------------------+-----------------------------------------+
Mars Learn
----------
Mars learn 提供了类似于 scikit-learn 的熟悉界面。
+---------------------------------------------+----------------------------------------------------+
| **Scikit-learn** | **Mars learn** |
+---------------------------------------------+----------------------------------------------------+
|.. code-block:: python |.. code-block:: python |
| | |
| from sklearn.datasets import make_blobs | from mars.learn.datasets import make_blobs |
| from sklearn.decomposition import PCA | from mars.learn.decomposition import PCA |
| X, y = make_blobs( | X, y = make_blobs( |
| n_samples=100000000, n_features=3, | n_samples=100000000, n_features=3, |
| centers=[[3, 3, 3], [0, 0, 0], | centers=[[3, 3, 3], [0, 0, 0], |
| [1, 1, 1], [2, 2, 2]], | [1, 1, 1], [2, 2, 2]], |
| cluster_std=[0.2, 0.1, 0.2, 0.2], | cluster_std=[0.2, 0.1, 0.2, 0.2], |
| random_state=9) | random_state=9) |
| pca = PCA(n_components=3) | pca = PCA(n_components=3) |
| pca.fit(X) | pca.fit(X) |
| print(pca.explained_variance_ratio_) | print(pca.explained_variance_ratio_) |
| print(pca.explained_variance_) | print(pca.explained_variance_) |
| | |
+---------------------------------------------+----------------------------------------------------+
Mars learn 还集成了许多库:
- `TensorFlow <https://mars-project.readthedocs.io//en/latest/user_guide/learn/tensorflow.html>`_
- `PyTorch <https://mars-project.readthedocs.io/en/latest/user_guide/learn/pytorch.html>`_
- `XGBoost <https://mars-project.readthedocs.io/en/latest/user_guide/learn/xgboost.html>`_
- `LightGBM <https://mars-project.readthedocs.io/en/latest/user_guide/learn/lightgbm.html>`_
- `Joblib <https://mars-project.readthedocs.io/en/latest/user_guide/learn/joblib.html>`_
- `Statsmodels <https://mars-project.readthedocs.io/en/latest/user_guide/learn/statsmodels.html>`_
Mars 远程
---------
Mars 远程允许用户并行执行功能。
+-------------------------------------------+--------------------------------------------+
| **Vanilla function calls** | **Mars remote** |
+-------------------------------------------+--------------------------------------------+
|.. code-block:: python |.. code-block:: python |
| | |
| import numpy as np | import numpy as np |
| | import mars.remote as mr |
| | |
| def calc_chunk(n, i): | def calc_chunk(n, i): |
| rs = np.random.RandomState(i) | rs = np.random.RandomState(i) |
| a = rs.uniform(-1, 1, size=(n, 2)) | a = rs.uniform(-1, 1, size=(n, 2)) |
| d = np.linalg.norm(a, axis=1) | d = np.linalg.norm(a, axis=1) |
| return (d < 1).sum() | return (d < 1).sum() |
| | |
| def calc_pi(fs, N): | def calc_pi(fs, N): |
| return sum(fs) * 4 / N | return sum(fs) * 4 / N |
| | |
| N = 200_000_000 | N = 200_000_000 |
| n = 10_000_000 | n = 10_000_000 |
| | |
| fs = [calc_chunk(n, i) | fs = [mr.spawn(calc_chunk, args=(n, i)) |
| for i in range(N // n)] | for i in range(N // n)] |
| pi = calc_pi(fs, N) | pi = mr.spawn(calc_pi, args=(fs, N)) |
| print(pi) | print(pi.execute().fetch()) |
| | |
+-------------------------------------------+--------------------------------------------+
|.. code-block:: |.. code-block:: |
| | |
| 3.1416312 | 3.1416312 |
| CPU times: user 32.2 s, sys: 4.86 s, | CPU times: user 616 ms, sys: 307 ms, |
| total: 37.1 s | total: 923 ms |
| Wall time: 12.4 s | Wall time: 3.99 s |
| | |
+-------------------------------------------+--------------------------------------------+
火星上的DASK
------------
参考 `DASK on Mars`_ 获取更多信息。
急切模式
`````````
Mars 支持急切模式,使其便于开发和调试。
用户可以通过选项启用急切模式,在程序或控制台会话开始时设置选项。
.. code-block:: python
>>> from mars.config import options
>>> options.eager_mode = True
或者使用上下文。
.. code-block:: python
>>> from mars.config import option_context
>>> with option_context() as options:
>>> options.eager_mode = True
>>> # 仅在with语句内打开急切模式
>>> ...
如果急切模式开启,一旦创建,张量、数据帧等将默认会在会话中立即执行。
.. code-block:: python
>>> import mars.tensor as mt
>>> import mars.dataframe as md
>>> from mars.config import options
>>> options.eager_mode = True
>>> t = mt.arange(6).reshape((2, 3))
>>> t
array([[0, 1, 2],
[3, 4, 5]])
>>> df = md.DataFrame(t)
>>> df.sum()
0 3
1 5
2 7
dtype: int64
火星上的Ray
------------
Mars 也与Ray深度集成,并能有效地运行于 `Ray <https://docs.ray.io/en/latest/>`_ 之上,
并与建立在Ray核心之上的庞大机器学习和分布式系统生态系统进行交互。
通过以下方式在本地启动新的火星上的Ray运行时环境:
.. code-block:: python
import mars
mars.new_session(backend='ray')
# 执行计算
与Ray数据集交互:
.. code-block:: python
import mars.tensor as mt
import mars.dataframe as md
df = md.DataFrame(
mt.random.rand(1000_0000, 4),
columns=list('abcd'))
# 将mars数据帧转换为ray数据集
ds = md.to_ray_dataset(df)
print(ds.schema(), ds.count())
ds.filter(lambda row: row["a"] > 0.5).show(5)
# 将ray数据集转换为mars数据帧
df2 = md.read_ray_dataset(ds)
print(df2.head(5).execute())
参考 `Mars on Ray`_ 获取更多信息。
轻松的缩放
------------------------------
Mars 可以在单台机器上缩放,也可以扩展到包含数千台机器的集群。
从单机迁移到一台集群以处理更多数据或获得更好的性能非常简单。
裸金属部署
通过在集群中的不同机器上启动Mars分布式运行时的不同组件,可以很容易地将Mars扩展到集群。
可选择一个节点作为主管,它集成了一个网络服务,其他节点作为工作节点。可以使用以下命令启动主管:
.. code-block:: bash
mars-supervisor -h <host_name> -p <supervisor_port> -w <web_port>
工作节点可以使用以下命令启动:
.. code-block:: bash
mars-worker -h <host_name> -p <worker_port> -s <supervisor_endpoint>
启动所有Mars进程后,用户可以运行
.. code-block:: python
>>> sess = new_session('http://<web_ip>:<ui_port>')
>>> # 执行计算
Kubernetes 部署
参考 `Run on Kubernetes`_ 获取更多信息。
Yarn 部署
``````````````
参考 `Run on Yarn`_ 获取更多信息。
参与其中
----------------
- 阅读 `开发指南 <https://mars-project.readthedocs.io/en/latest/development/index.html>`_。
- 加入我们的Slack工作群:`Slack <https://join.slack.com/t/mars-computing/shared_invite/zt-17pw2cfua-NRb2H4vrg77pr9T4g3nQOQ>`_。
- 加入邮件列表:发送邮件至 `mars-dev@googlegroups.com`_。
- 请通过提交 `GitHub 问题`_ 报告错误。
- 通过 `拉取请求`_ 提交贡献。
提前感谢您的贡献!
.. |Build| image:: https://github.com/mars-project/mars/workflows/Mars%20CI%20Core/badge.svg
:target: https://github.com/mars-project/mars/actions
.. |Coverage| image:: https://codecov.io/gh/mars-project/mars/branch/master/graph/badge.svg
:target: https://codecov.io/gh/mars-project/mars
.. |Quality| image:: https://img.shields.io/codacy/grade/6a80bb4659ed410eb33795f580c8615e.svg
:target: https://app.codacy.com/project/mars-project/mars/dashboard
.. |PyPI version| image:: https://img.shields.io/pypi/v/pymars.svg
:target: https://pypi.python.org/pypi/pymars
.. |Docs| image:: https://img.shields.io/badge/docs-latest-brightgreen.svg
:target: `Documentation`_
.. |License| image:: https://img.shields.io/pypi/l/pymars.svg
:target: https://github.com/mars-project/mars/blob/master/LICENSE
.. _`mars-dev@googlegroups.com`: https://groups.google.com/forum/#!forum/mars-dev
.. _`GitHub issue`: https://github.com/mars-project/mars/issues
.. _`pull requests`: https://github.com/mars-project/mars/pulls
.. _`Documentation`: https://mars-project.readthedocs.io
.. _`中文文档`: https://mars-project.readthedocs.io/zh_CN/latest/
.. _`Mars on Ray`: https://mars-project.readthedocs.io/en/latest/installation/ray.html
.. _`Run on Kubernetes`: https://mars-project.readthedocs.io/en/latest/installation/kubernetes.html
.. _`Run on Yarn`: https://mars-project.readthedocs.io/en/latest/installation/yarn.html
.. _`DASK on Mars`: https://mars-project.readthedocs.io/en/latest/user_guide/contrib/dask.html