|PyPI 版本| |Conda Forge 版本| |PyPI 下载量| |许可证| |测试状态| |代码覆盖率|
|RTD 状态| |Binder| |JOSS| |NumFOCUS| |FOSSA|
STUMPY
STUMPY 是一个功能强大且可扩展的 Python 库,它能高效计算所谓的"矩阵剖面",这只是一种学术说法,意思是"对于时间序列中的每个(绿色)子序列,自动识别其对应的最近邻(灰色)":
重要的是,一旦计算出矩阵剖面(上图中间面板),就可以将其用于各种时间序列数据挖掘任务,例如:
- 模式/主题(长时间序列中大致重复的子序列)发现
- 异常/新颖性(不协调)发现
- 特征形状发现
- 语义分割
- 流式(在线)数据
- 快速近似矩阵剖面
- 时间序列链(时序排列的子序列模式集)
- 用于总结长时间序列的片段
- 用于选择最佳子序列窗口大小的全矩阵剖面
- 以及更多...
无论您是学者、数据科学家、软件开发人员还是时间序列爱好者,STUMPY 都易于安装,我们的目标是让您更快地获得时间序列洞察。有关更多信息,请参阅文档。
如何使用 STUMPY
请查看我们的 API 文档以获取可用函数的完整列表,并参阅我们内容丰富的教程以了解更全面的示例用例。下面,您将找到快速演示如何使用 STUMPY 的代码片段。
典型用法(一维时间序列数据)与 STUMP <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.stump>
__:
.. code:: python
import stumpy
import numpy as np
if __name__ == "__main__":
your_time_series = np.random.rand(10000)
window_size = 50 # 大约可能在一个模式中找到多少个数据点
matrix_profile = stumpy.stump(your_time_series, m=window_size)
通过 Dask Distributed 使用 STUMPED <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.stumped>
__ 进行一维时间序列数据的分布式处理:
.. code:: python
import stumpy
import numpy as np
from dask.distributed import Client
if __name__ == "__main__":
with Client() as dask_client:
your_time_series = np.random.rand(10000)
window_size = 50 # 大约可能在一个模式中找到多少个数据点
matrix_profile = stumpy.stumped(dask_client, your_time_series, m=window_size)
使用 GPU-STUMP <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.gpu_stump>
__ 进行一维时间序列数据的 GPU 处理:
.. code:: python
import stumpy
import numpy as np
from numba import cuda
if __name__ == "__main__":
your_time_series = np.random.rand(10000)
window_size = 50 # 大约可能在一个模式中找到多少个数据点
all_gpu_devices = [device.id for device in cuda.list_devices()] # 获取所有可用 GPU 设备的列表
matrix_profile = stumpy.gpu_stump(your_time_series, m=window_size, device_id=all_gpu_devices)
使用 MSTUMP <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mstump>
__ 处理多维时间序列数据:
.. code:: python
import stumpy
import numpy as np
if __name__ == "__main__":
your_time_series = np.random.rand(3, 1000) # 每行代表来自不同维度的数据,而每列代表来自同一维度的数据
window_size = 50 # 大约可能在一个模式中找到多少个数据点
matrix_profile, matrix_profile_indices = stumpy.mstump(your_time_series, m=window_size)
使用 Dask Distributed 的 MSTUMPED <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.mstumped>
__ 进行分布式多维时间序列数据分析:
.. code:: python
import stumpy
import numpy as np
from dask.distributed import Client
if __name__ == "__main__":
with Client() as dask_client:
your_time_series = np.random.rand(3, 1000) # 每行代表来自不同维度的数据,而每列代表来自同一维度的数据
window_size = 50 # 大约可能在一个模式中找到多少个数据点
matrix_profile, matrix_profile_indices = stumpy.mstumped(dask_client, your_time_series, m=window_size)
使用 锚定时间序列链(ATSC) <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.atsc>
__ 进行时间序列链分析:
.. code:: python
import stumpy
import numpy as np
if __name__ == "__main__":
your_time_series = np.random.rand(10000)
window_size = 50 # 大约可能在一个模式中找到多少个数据点
matrix_profile = stumpy.stump(your_time_series, m=window_size)
left_matrix_profile_index = matrix_profile[:, 2]
right_matrix_profile_index = matrix_profile[:, 3]
idx = 10 # 要检索锚定时间序列链的子序列索引
anchored_chain = stumpy.atsc(left_matrix_profile_index, right_matrix_profile_index, idx)
all_chain_set, longest_unanchored_chain = stumpy.allc(left_matrix_profile_index, right_matrix_profile_index)
使用 快速低成本单势语义分割(FLUSS) <https://stumpy.readthedocs.io/en/latest/api.html#stumpy.fluss>
__ 进行语义分割:
.. code:: python
import stumpy
import numpy as np
if __name__ == "__main__":
your_time_series = np.random.rand(10000)
window_size = 50 # 大约可能在一个模式中找到多少个数据点
matrix_profile = stumpy.stump(your_time_series, m=window_size)
subseq_len = 50
correct_arc_curve, regime_locations = stumpy.fluss(matrix_profile[:, 1],
L=subseq_len,
n_regimes=2,
excl_factor=1
)
依赖项
支持的 Python 和 NumPy 版本根据 NEP 29 弃用政策 <https://numpy.org/neps/nep-0029-deprecation_policy.html>
__ 确定。
NumPy <http://www.numpy.org/>
__Numba <http://numba.pydata.org/>
__SciPy <https://www.scipy.org/>
__
如何获取
推荐使用 Conda 安装:
.. code:: bash
conda install -c conda-forge stumpy
假设您已安装 numpy、scipy 和 numba,可以使用 PyPI 安装:
.. code:: bash
python -m pip install stumpy
要从源代码安装 stumpy,请参阅 文档 <https://stumpy.readthedocs.io/en/latest/install.html>
__ 中的说明。
文档
为了充分理解和掌握底层算法和应用,阅读原始出版物至关重要。如需更详细的STUMPY使用示例,请查阅最新的文档或浏览我们的实践教程。
性能
我们使用Numba JIT编译版本的代码,在随机生成的不同长度时间序列数据(即 np.random.rand(n)
)上测试了计算精确矩阵剖面的性能,并使用了不同的CPU和GPU硬件资源。
[图片:STUMPY性能图]
下表以时:分:秒.毫秒的格式显示了原始结果,窗口大小固定为 m = 50
。请注意,这些报告的运行时间包括将数据从主机传输到所有GPU设备所需的时间。您可能需要向右滚动表格才能查看所有运行时间。
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| i | n = 2^i | GPU-STOMP | STUMP.2 | STUMP.16 | STUMPED.128 | STUMPED.256 | GPU-STUMP.1 | GPU-STUMP.2 | GPU-STUMP.DGX1 | GPU-STUMP.DGX2 |
+==========+===================+==============+=============+=============+=============+=============+=============+=============+================+================+
| 6 | 64 | 00:00:10.00 | 00:00:00.00 | 00:00:00.00 | 00:00:05.77 | 00:00:06.08 | 00:00:00.03 | 00:00:01.63 | 不适用 | 不适用 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 7 | 128 | 00:00:10.00 | 00:00:00.00 | 00:00:00.00 | 00:00:05.93 | 00:00:07.29 | 00:00:00.04 | 00:00:01.66 | 不适用 | 不适用 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 8 | 256 | 00:00:10.00 | 00:00:00.00 | 00:00:00.01 | 00:00:05.95 | 00:00:07.59 | 00:00:00.08 | 00:00:01.69 | 00:00:06.68 | 00:00:25.68 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 9 | 512 | 00:00:10.00 | 00:00:00.00 | 00:00:00.02 | 00:00:05.97 | 00:00:07.47 | 00:00:00.13 | 00:00:01.66 | 00:00:06.59 | 00:00:27.66 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 10 | 1024 | 00:00:10.00 | 00:00:00.02 | 00:00:00.04 | 00:00:05.69 | 00:00:07.64 | 00:00:00.24 | 00:00:01.72 | 00:00:06.70 | 00:00:30.49 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 11 | 2048 | 不适用 | 00:00:00.05 | 00:00:00.09 | 00:00:05.60 | 00:00:07.83 | 00:00:00.53 | 00:00:01.88 | 00:00:06.87 | 00:00:31.09 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 12 | 4096 | 不适用 | 00:00:00.22 | 00:00:00.19 | 00:00:06.26 | 00:00:07.90 | 00:00:01.04 | 00:00:02.19 | 00:00:06.91 | 00:00:33.93 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 13 | 8192 | 不适用 | 00:00:00.50 | 00:00:00.41 | 00:00:06.29 | 00:00:07.73 | 00:00:01.97 | 00:00:02.49 | 00:00:06.61 | 00:00:33.81 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 14 | 16384 | 不适用 | 00:00:01.79 | 00:00:00.99 | 00:00:06.24 | 00:00:08.18 | 00:00:03.69 | 00:00:03.29 | 00:00:07.36 | 00:00:35.23 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 15 | 32768 | 不适用 | 00:00:06.17 | 00:00:02.39 | 00:00:06.48 | 00:00:08.29 | 00:00:07.45 | 00:00:04.93 | 00:00:07.02 | 00:00:36.09 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 16 | 65536 | 未知 | 00:00:22.94 | 00:00:06.42 | 00:00:07.33 | 00:00:09.01 | 00:00:14.89 | 00:00:08.12 | 00:00:08.10 | 00:00:36.54 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 17 | 131072 | 00:00:10.00 | 00:01:29.27 | 00:00:19.52 | 00:00:09.75 | 00:00:10.53 | 00:00:29.97 | 00:00:15.42 | 00:00:09.45 | 00:00:37.33 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 18 | 262144 | 00:00:18.00 | 00:05:56.50 | 00:01:08.44 | 00:00:33.38 | 00:00:24.07 | 00:00:59.62 | 00:00:27.41 | 00:00:13.18 | 00:00:39.30 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 19 | 524288 | 00:00:46.00 | 00:25:34.58 | 00:03:56.82 | 00:01:35.27 | 00:03:43.66 | 00:01:56.67 | 00:00:54.05 | 00:00:19.65 | 00:00:41.45 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 20 | 1048576 | 00:02:30.00 | 01:51:13.43 | 00:19:54.75 | 00:04:37.15 | 00:03:01.16 | 00:05:06.48 | 00:02:24.73 | 00:00:32.95 | 00:00:46.14 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 21 | 2097152 | 00:09:15.00 | 09:25:47.64 | 03:05:07.64 | 00:13:36.51 | 00:08:47.47 | 00:20:27.94 | 00:09:41.43 | 00:01:06.51 | 00:01:02.67 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 22 | 4194304 | 未知 | 36:12:23.74 | 10:37:51.21 | 00:55:44.43 | 00:32:06.70 | 01:21:12.33 | 00:38:30.86 | 00:04:03.26 | 00:02:23.47 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 23 | 8388608 | 未知 | 143:16:09.94| 38:42:51.42 | 03:33:30.53 | 02:00:49.37 | 05:11:44.45 | 02:33:14.60 | 00:15:46.26 | 00:08:03.76 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 24 | 16777216 | 未知 | 未知 | 未知 | 14:39:11.99 | 07:13:47.12 | 20:43:03.80 | 09:48:43.42 | 01:00:24.06 | 00:29:07.84 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 未知 | 17729800 | 09:16:12.00 | 未知 | 未知 | 15:31:31.75 | 07:18:42.54 | 23:09:22.43 | 10:54:08.64 | 01:07:35.39 | 00:32:51.55 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 25 | 33554432 | 未知 | 未知 | 未知 | 56:03:46.81 | 26:27:41.29 | 83:29:21.06 | 39:17:43.82 | 03:59:32.79 | 01:54:56.52 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 26 | 67108864 | NaN | NaN | NaN | 211:17:37.60| 106:40:17.17| 328:58:04.68| 157:18:30.50| 15:42:15.94 | 07:18:52.91 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| NaN | 100000000 | 291:07:12.00 | NaN | NaN | NaN | 234:51:35.39| NaN | NaN | 35:03:44.61 | 16:22:40.81 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
| 27 | 134217728 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 64:41:55.09 | 29:13:48.12 |
+----------+-------------------+--------------+-------------+-------------+-------------+-------------+-------------+-------------+----------------+----------------+
^^^^^^^^^^^^^^^^^^
硬件资源
^^^^^^^^^^^^^^^^^^
.. _hardware:
GPU-STOMP:这些结果来自原始的《Matrix Profile II》论文 - NVIDIA Tesla K80(包含2个GPU),作为性能基准用于比较。
STUMP.2:使用总共2个CPU执行的stumpy.stump - 2个Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz处理器,在单个服务器上使用Numba并行化,不使用Dask。
STUMP.16:使用总共16个CPU执行的stumpy.stump - 16个Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz处理器,在单个服务器上使用Numba并行化,不使用Dask。
STUMPED.128:使用总共128个CPU执行的stumpy.stumped - 8个Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz处理器 x 16台服务器,使用Numba并行化,并使用Dask Distributed分布式处理。
STUMPED.256:使用总共256个CPU执行的stumpy.stumped - 8个Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz处理器 x 32台服务器,使用Numba并行化,并使用Dask Distributed分布式处理。
GPU-STUMP.1:使用1个NVIDIA GeForce GTX 1080 Ti GPU执行的stumpy.gpu_stump,每个块512个线程,200W功率限制,使用Numba编译为CUDA,并使用Python多进程并行化。
GPU-STUMP.2:使用2个NVIDIA GeForce GTX 1080 Ti GPU执行的stumpy.gpu_stump,每个块512个线程,200W功率限制,使用Numba编译为CUDA,并使用Python多进程并行化。
GPU-STUMP.DGX1:使用8个NVIDIA Tesla V100执行的stumpy.gpu_stump,每个块512个线程,使用Numba编译为CUDA,并使用Python多进程并行化。
GPU-STUMP.DGX2:使用16个NVIDIA Tesla V100执行的stumpy.gpu_stump,每个块512个线程,使用Numba编译为CUDA,并使用Python多进程并行化。
运行测试
测试用例位于tests目录中,使用PyTest进行处理,并需要coverage.py进行代码覆盖率分析。可以通过以下命令执行测试:
.. code:: bash
./test.sh
Python版本
STUMPY支持Python 3.8+,由于使用了Unicode变量名/标识符,不兼容Python 2.x。鉴于依赖项较少,STUMPY可能在较旧版本的Python上工作,但这超出了我们的支持范围,我们强烈建议您升级到最新版本的Python。
获取帮助
首先,请查看GitHub上的讨论和问题,看看您的问题是否已经得到解答。如果没有找到解决方案,欢迎开启新的讨论或问题,作者们会尽量及时回应。
贡献
我们欢迎任何形式的贡献!我们始终欢迎对文档的协助,特别是扩展教程。如果您想贡献,请fork项目,做出更改,然后提交拉取请求。我们会尽最大努力解决任何问题,并将您的代码合并到主分支中。
引用
如果您在科学出版物中使用了这个代码库并希望引用它,请使用《开源软件期刊》的文章。
S.M. Law, (2019). STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining. Journal of Open Source Software, 4(39), 1504.
.. code:: bibtex
@article{law2019stumpy,
author = {Law, Sean M.},
title = {{STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining}},
journal = {{The Journal of Open Source Software}},
volume = {4},
number = {39},
pages = {1504},
year = {2019}
}
参考文献
.. publications:
Yeh, Chin-Chia Michael等人(2016)矩阵剖面I:时间序列的所有对相似性连接:包括模式、不和谐和形状的统一视图。ICDM:1317-1322。链接 <https://ieeexplore.ieee.org/abstract/document/7837992>
_
Zhu, Yan等人(2016)矩阵剖面II:利用新算法和GPU突破一亿时间序列模式和连接的障碍。ICDM:739-748。链接 <https://ieeexplore.ieee.org/abstract/document/7837898>
__
Yeh, Chin-Chia Michael等人(2017)矩阵剖面VI:有意义的多维模式发现。ICDM:565-574。链接 <https://ieeexplore.ieee.org/abstract/document/8215529>
__
Zhu, Yan等人(2017)矩阵剖面VII:时间序列链:时间序列数据挖掘的新原语。ICDM:695-704。链接 <https://ieeexplore.ieee.org/abstract/document/8215542>
__
Gharghabi, Shaghayegh等人(2017)矩阵剖面VIII:领域无关的在线语义分割,实现超人类性能水平。ICDM:117-126。链接 <https://ieeexplore.ieee.org/abstract/document/8215484>
__
Zhu, Yan等人(2017)利用新算法和GPU突破十万亿对比障碍,用于时间序列模式和连接。KAIS:203-236。链接 <https://link.springer.com/article/10.1007%2Fs10115-017-1138-x>
__
Zhu, Yan等人(2018)矩阵剖面XI:SCRIMP++:交互速度的时间序列模式发现。ICDM:837-846。链接 <https://ieeexplore.ieee.org/abstract/document/8594908>
__
Yeh, Chin-Chia Michael等人(2018)时间序列连接、模式、不和谐和形状:利用矩阵剖面的统一视图。数据挖掘与知识发现:83-123。链接 <https://link.springer.com/article/10.1007/s10618-017-0519-9>
__
Gharghabi, Shaghayegh等人(2018)"矩阵剖面XII:MPdist:一种新的时间序列距离度量,允许在更具挑战性的场景中进行数据挖掘。" ICDM:965-970。链接 <https://ieeexplore.ieee.org/abstract/document/8594928>
__
Zimmerman, Zachary等人(2019)矩阵剖面XIV:利用GPU扩展时间序列模式发现,突破每天一千万亿对比及以上。SoCC '19:74-86。链接 <https://dl.acm.org/doi/10.1145/3357223.3362721>
__
Akbarinia, Reza和Betrand Cloez(2019)使用不同距离函数的高效矩阵剖面计算。arXiv:1901.05708。链接 <https://arxiv.org/abs/1901.05708>
__
Kamgar, Kaveh等人(2019)矩阵剖面XV:利用时间序列共识模式发现时间序列集中的结构。ICDM:1156-1161。链接 <https://ieeexplore.ieee.org/abstract/document/8970797>
__
许可证和商标
| STUMPY | 版权所有 2019 TD Ameritrade。根据3条款BSD许可证发布。 | STUMPY是TD Ameritrade IP Company, Inc.的商标。保留所有权利。