datasketch:让大数据看起来很小
.. image:: https://static.pepy.tech/badge/datasketch/month :target: https://pepy.tech/project/datasketch
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.598238.svg :target: https://zenodo.org/doi/10.5281/zenodo.598238
datasketch提供概率数据结构,可以快速处理和搜索大量数据,同时保持较高的准确性。
该包包含以下数据草图:
+-------------------------+-----------------------------------------------+
| 数据草图 | 用途 |
+=========================+===============================================+
| MinHash
_ | 估计Jaccard相似度和基数 |
+-------------------------+-----------------------------------------------+
| Weighted MinHash
_ | 估计加权Jaccard相似度 |
+-------------------------+-----------------------------------------------+
| HyperLogLog
_ | 估计基数 |
+-------------------------+-----------------------------------------------+
| HyperLogLog++
_ | 估计基数 |
+-------------------------+-----------------------------------------------+
为支持亚线性查询时间,提供了以下数据草图索引:
+---------------------------+-----------------------------+------------------------+
| 索引 | 适用数据草图 | 支持的查询类型 |
+===========================+=============================+========================+
| MinHash LSH
_ | MinHash, Weighted MinHash | Jaccard阈值 |
+---------------------------+-----------------------------+------------------------+
| MinHash LSH Forest
_ | MinHash, Weighted MinHash | Jaccard Top-K |
+---------------------------+-----------------------------+------------------------+
| MinHash LSH Ensemble
_ | MinHash | 包含度阈值 |
+---------------------------+-----------------------------+------------------------+
| HNSW
_ | 任意 | 自定义度量Top-K |
+---------------------------+-----------------------------+------------------------+
datasketch必须与Python 3.7或更高版本、NumPy 1.11或更高版本以及Scipy一起使用。
注意,MinHash LSH
_和MinHash LSH Ensemble
还支持Redis和Cassandra存储层(参见大规模MinHash LSH
)。
安装
使用pip
安装datasketch:
::
pip install datasketch
这也会安装NumPy作为依赖项。
安装带Redis依赖的版本:
::
pip install datasketch[redis]
安装带Cassandra依赖的版本:
::
pip install datasketch[cassandra]
.. _MinHash
: https://ekzhu.github.io/datasketch/minhash.html
.. _Weighted MinHash
: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _HyperLogLog
: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _HyperLogLog++
: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _MinHash LSH
: https://ekzhu.github.io/datasketch/lsh.html
.. _MinHash LSH Forest
: https://ekzhu.github.io/datasketch/lshforest.html
.. _MinHash LSH Ensemble
: https://ekzhu.github.io/datasketch/lshensemble.html
.. _大规模MinHash LSH
: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale
.. _HNSW
: https://ekzhu.github.io/datasketch/documentation.html#hnsw