k-Shape: 高效精确的时间序列聚类
k-Shape 是一种高度精确和高效的无监督方法,用于单变量和多变量时间序列聚类。k-Shape 首次亮相于 ACM SIGMOD 2015 会议,被评为(2篇)最佳论文之一,并获得了首届 2015 ACM SIGMOD 研究亮点奖。其扩展版本发表在 ACM TODS 2017 期刊上。此后,k-Shape 在单变量和多变量时间序列数据集上都取得了最先进的性能(即,k-Shape 是最快和最准确的时间序列聚类方法之一,在包含 100 多个数据集的权威基准测试中排名靠前)。
k-Shape 已被广泛应用于多个科学领域(如计算机科学、社会科学、空间科学、工程、计量经济学、生物学、神经科学和医学)、财富 100-500 强企业(如 Exelon、诺基亚和许多金融公司)以及欧洲航天局等组织。
如果您在项目或研究中使用 k-Shape,请引用以下两篇论文:
参考文献
"k-Shape: 高效精确的时间序列聚类"
John Paparrizos 和 Luis Gravano
2015 ACM SIGMOD 数据管理国际会议(ACM SIGMOD 2015)
@inproceedings{paparrizos2015k,
title={{k-Shape: Efficient and Accurate Clustering of Time Series}},
author={Paparrizos, John and Gravano, Luis},
booktitle={Proceedings of the 2015 ACM SIGMOD international conference on management of data},
pages={1855--1870},
year={2015}
}
"快速精确的时间序列聚类"
John Paparrizos 和 Luis Gravano
ACM 数据库系统学报(ACM TODS 2017),第 42(2) 卷,第 1-49 页
@article{paparrizos2017fast,
title={{Fast and Accurate Time-Series Clustering}},
author={Paparrizos, John and Gravano, Luis},
journal={ACM Transactions on Database Systems (ACM TODS)},
volume={42},
number={2},
pages={1--49},
year={2017}
}
致谢
我们感谢 Teja Bogireddy 对本仓库的宝贵帮助。
我们还要感谢最初的贡献者 Jörg Thalheim 和 Gregory Rehm。初始代码被用于 Sieve。
k-Shape 的 Python 仓库
本仓库包含 k-Shape 的 Python 实现。Matlab 版本请查看这里。
数据
为了便于复现,我们分享了我们在两个权威基准上的结果:
- UCR 单变量档案,包含 128 个单变量时间序列数据集。
- 在这里下载所有 128 个预处理过的数据集。
- UAE 多变量档案,包含 28 个多变量时间序列数据集。
预处理步骤请查看这里。
安装
我们的代码依赖以下 Python 包:
从 pip 安装
$ pip install kshape
从源代码安装
$ git clone https://github.com/thedatumorg/kshape-python
$ cd kshape-python
$ python setup.py install
基准测试
我们展示了在改变时间序列数量、聚类数量和时间序列长度时 k-Shape 的运行时性能。(所有结果都是 5 次运行的平均值。)
使用方法
单变量示例:
import numpy as np
from kshape.core import KShapeClusteringCPU
from kshape.core_gpu import KShapeClusteringGPU
univariate_ts_datasets = np.expand_dims(np.random.rand(200, 60), axis=2)
num_clusters = 3
# CPU 模型
ksc = KShapeClusteringCPU(num_clusters, centroid_init='zero', max_iter=100, n_jobs=-1)
ksc.fit(univariate_ts_datasets)
labels = ksc.labels_ # 或 ksc.predict(univariate_ts_datasets)
cluster_centroids = ksc.centroids_
# GPU 模型
ksg = KShapeClusteringGPU(num_clusters, centroid_init='zero', max_iter=100)
ksg.fit(univariate_ts_datasets)
labels = ksg.labels_
cluster_centroids = ksg.centroids_.detach().cpu()
多变量示例:
import numpy as np
from kshape.core import KShapeClusteringCPU
from kshape.core_gpu import KShapeClusteringGPU
multivariate_ts_datasets = np.random.rand(200, 60, 6)
num_clusters = 3
# CPU 模型
ksc = KShapeClusteringCPU(num_clusters, centroid_init='zero', max_iter=100, n_jobs=-1)
ksc.fit(univariate_ts_datasets)
labels = ksc.labels_
cluster_centroids = ksc.centroids_
# GPU 模型
ksg = KShapeClusteringGPU(num_clusters, centroid_init='zero', max_iter=100)
ksg.fit(univariate_ts_datasets)
labels = ksg.labels_
cluster_centroids = ksg.centroids_.detach().cpu()
另请参阅示例了解 UCR/UAE 数据集聚类
结果
下表包含 k-Shape 在单变量和多变量数据集上进行 10 次运行的平均兰德指数(RI)、调整兰德指数(ARI)和归一化互信息(NMI)准确度值。
注意:我们使用单核实现收集了这些结果。
服务器规格:AMD Ryzen 9 5900HX 8核 3.30 GHz,16GB RAM。
GPU规格:NVIDIA GeForce RTX 3070,8GB显存。
单变量结果:
数据集 | RI | ARI | NMI | 运行时间(秒) |
---|---|---|---|---|
ACSF1 | 0.728889447 | 0.139127178 | 0.385362576 | 181.97282 |
Adiac | 0.948199219 | 0.237456072 | 0.585026777 | 150.23389 |
AllGestureWiimoteX | 0.830988989 | 0.091833105 | 0.19967124 | 132.64325 |
AllGestureWiimoteY | 0.83356036 | 0.1306081 | 0.265320116 | 68.32064 |
AllGestureWiimoteZ | 0.831796196 | 0.08184644 | 0.184288361 | 117.54415 |
ArrowHead | 0.623696682 | 0.176408828 | 0.251716443 | 1.42841 |
Beef | 0.666553672 | 0.102291622 | 0.274983496 | 2.04646 |
BeetleFly | 0.518461538 | 0.037243262 | 0.049170634 | 0.62138 |
BirdChicken | 0.522948718 | 0.046863444 | 0.055805713 | 0.46606 |
BME | 0.623662322 | 0.209189215 | 0.337562447 | 0.75734 |
Car | 0.668095238 | 0.142785926 | 0.222574613 | 4.87239 |
CBF | 0.875577393 | 0.724563717 | 0.770334057 | 7.47873 |
Chinatown | 0.526075568 | 0.041117166 | 0.015693819 | 0.548231 |
ChlorineConcentration | 0.526233814 | -0.001019087 | 0.000772354 | 68.01957 |
CinCECGTorso | 0.625307149 | 0.051803606 | 0.093350668 | 271.74131 |
Coffee | 0.726493506 | 0.453837834 | 0.421820948 | 0.41349 |
Computers | 0.529187976 | 0.058481715 | 0.0485609 | 3.01130 |
CricketX | 0.869701787 | 0.174655947 | 0.357916915 | 55.23645 |
CricketY | 0.873153945 | 0.206381317 | 0.373656368 | 48.83094 |
CricketZ | 0.869909812 | 0.172669605 | 0.355604411 | 44.52660 |
Crop | 0.924108349 | 0.241974335 | 0.4388123 | 5420.01129 |
DiatomSizeReduction | 0.919179195 | 0.807710845 | 0.827117298 | 1.59904 |
DistalPhalanxOutlineAgeGroup | 0.722184825 | 0.435943568 | 0.329905608 | 2.12145 |
DistalPhalanxOutlineCorrect | 0.499455708 | -0.001030351 | 2.97E-05 | 2.26317 |
DistalPhalanxTW | 0.839607976 | 0.59272726 | 0.531060255 | 10.96752 |
DodgerLoopDay | 0.781988229 | 0.210916925 | 0.402897375 | 1.69891 |
DodgerLoopGame | 0.570071757 | 0.140620499 | 0.117161969 | 0.86779 |
DodgerLoopWeekend | 0.830807063 | 0.657966909 | 0.628131221 | 0.495587 |
Earthquakes | 0.541659908 | 0.024267193 | 0.006262268 | 9.69413 |
ECG200 | 0.613753769 | 0.215794222 | 0.12870574 | 0.74401 |
ECG5000 | 0.771307998 | 0.530703353 | 0.523220504 | 163.82402 |
ECGFiveDays | 0.811446734 | 0.623122565 | 0.586492573 | 4.52766 |
ElectricDevices | 0.693551963 | 0.071161449 | 0.177107461 | 591.80007 |
EOGHorizontalSignal | 0.86864851 | 0.227034804 | 0.408923026 | 357.01975 |
EOGVerticalSignal | 0.87082521 | 0.200763231 | 0.37416983 | 236.19376 |
EthanolLevel | 0.622273617 | 0.003480205 | 0.007896876 | 188.62335 |
FaceAll | 0.910295025 | 0.433266026 | 0.610598916 | 317.37956 |
FaceFour | 0.757335907 | 0.374239896 | 0.466746543 | 1.38740 |
FacesUCR | 0.910295025 | 0.433266026 | 0.610598916 | 136.62772 |
FiftyWords | 0.951558207 | 0.358925864 | 0.651569015 | 198.84656 |
Fish | 0.785345886 | 0.189885615 | 0.327951361 | 17.13432 |
FordA | 0.564619244 | 0.129237686 | 0.096210429 | 344.81591 |
FordB | 0.516109383 | 0.032218211 | 0.023938345 | 254.47971 |
FreezerRegularTrain | 0.638744137 | 0.277488682 | 0.211547387 | 18.45565 |
小型冰柜训练 | 0.639049682 | 0.278099783 | 0.212045663 | 26.71921 |
真菌 | 0.829126823 | 0.357543672 | 0.731173267 | 6.11174 |
空中手势D1 | 0.944819412 | 0.2937662 | 0.635503444 | 30.88751 |
空中手势D2 | 0.947697224 | 0.348582475 | 0.677310905 | 43.38524 |
空中手势D3 | 0.931266132 | 0.126759199 | 0.458782509 | 18.98568 |
卵石手势Z1 | 0.883081466 | 0.585931482 | 0.675293127 | 11.72848 |
卵石手势Z2 | 0.881353135 | 0.580554538 | 0.66392792 | 7.60654 |
枪指 | 0.497487437 | -0.005050505 | 0 | 0.431333 |
枪指年龄跨度 | 0.531991131 | 0.064141145 | 0.053146884 | 1.59410 |
枪指男性对比女性 | 0.790127618 | 0.580242081 | 0.571776535 | 1.08047 |
枪指老年对比青年 | 0.518734664 | 0.037473134 | 0.028207614 | 3.55970 |
火腿 | 0.528831556 | 0.057673104 | 0.044612673 | 2.13764 |
手部轮廓 | 0.682856686 | 0.360051947 | 0.251176285 | 247.46488 |
触觉 | 0.689075575 | 0.063709939 | 0.09042192 | 97.01234 |
鲱鱼 | 0.501464075 | 0.003160642 | 0.007650463 | 1.22652 |
二十户住宅 | 0.520197437 | 0.040014774 | 0.03248788 | 49.73466 |
直排轮滑 | 0.734065189 | 0.039846163 | 0.104643365 | 372.13227 |
昆虫EPG常规训练 | 0.706511773 | 0.363941816 | 0.379556522 | 7.86684 |
昆虫EPG小型训练 | 0.70409136 | 0.361370964 | 0.379504988 | 5.37182 |
昆虫翅膀拍打声 | 0.792640539 | 0.196225831 | 0.402373638 | 220.85374 |
意大利电力需求 | 0.60972886 | 0.219608406 | 0.188152403 | 3.01081 |
大型厨房电器 | 0.570070672 | 0.125576669 | 0.130422376 | 12.03511 |
闪电2 | 0.531294766 | 0.057017617 | 0.089783145 | 1.93780 |
闪电7 | 0.806175515 | 0.322963065 | 0.506494431 | 4.51913 |
马拉特 | 0.924756461 | 0.721656055 | 0.869891088 | 84.35894 |
肉类 | 0.761918768 | 0.494403401 | 0.580422751 | 0.86227 |
医学图像 | 0.672005013 | 0.073490231 | 0.2287366 | 32.23141 |
墨尔本行人 | 0.869441656 | 0.349104777 | 0.470402239 | 275.40925 |
中指轮廓年龄组 | 0.729585262 | 0.423115226 | 0.401722498 | 1.57184 |
中指轮廓校正 | 0.49977175 | -0.00373634 | 0.000894849 | 2.28809 |
中指TW | 0.809347564 | 0.449636118 | 0.431364361 | 8.09901 |
混合形状常规训练 | 0.800991079 | 0.420414418 | 0.488448041 | 285.77452 |
混合形状小型训练 | 0.800795029 | 0.419036374 | 0.4766379 | 115.97755 |
尘埃应变 | 0.804809143 | 0.609589015 | 0.501865061 | 4.56190 |
非侵入式胎儿心电图胸部1 | 0.950981974 | 0.33373922 | 0.676420909 | 2995.88974 |
非侵入式胎儿心电图胸部2 | 0.967174335 | 0.465761156 | 0.765614776 | 1748.11823 |
橄榄油 | 0.806892655 | 0.570012361 | 0.607418333 | 1.97315 |
OSU叶片 | 0.785105837 | 0.263550973 | 0.361580708 | 18.38517 |
指骨轮廓校正 | 0.505362413 | 0.01070369 | 0.010221576 | 6.79001 |
音素 | 0.92769786 | 0.034705732 | 0.210108984 | 1747.00270 |
Wiimote Z轴拾取手势 | 0.854545455 | 0.288210152 | 0.540234358 | 3.61598 |
猪气道压力 | 0.903229862 | 0.03338252 | 0.427579631 | 1632.92364 |
猪动脉压力 | 0.959821502 | 0.273442178 | 0.717389411 | 914.99103 |
猪中心静脉压 | 0.961346772 | 0.194516974 | 0.658363736 | 1304.41961 |
PLAID | 0.859444881 | 0.281634259 | 0.40487855 | 555.89190 |
平面 | 0.911765778 | 0.708344209 | 0.851592604 | 1.14514 |
电力消耗 | 0.57637883 | 0.153069982 | 0.137929689 | 1.74243 |
近端指骨轮廓年龄组 | 0.752674183 | 0.477154395 | 0.468537655 | 1.72700 |
近端指骨轮廓正确 | 0.53390585 | 0.066453288 | 0.08535263 | 1.15338 |
近端指骨TW | 0.831222703 | 0.569454692 | 0.550694374 | 5.31783 |
制冷设备 | 0.556208278 | 0.007595278 | 0.009437609 | 28.19549 |
岩石 | 0.696935818 | 0.218081493 | 0.322230745 | 179.14048 |
屏幕类型 | 0.559603738 | 0.010528249 | 0.011742597 | 26.81045 |
肌电图手部性别Ch2 | 0.546315412 | 0.091559428 | 0.058471281 | 39.87313 |
肌电图手部动作Ch2 | 0.739443579 | 0.116429522 | 0.209097135 | 195.28737 |
肌电图手部受试者Ch2 | 0.724787047 | 0.19660949 | 0.263889093 | 211.94098 |
摇动手势WiimoteZ | 0.903171717 | 0.471533102 | 0.684959604 | 3.51105 |
形状模拟 | 0.699939698 | 0.400050425 | 0.377331686 | 3.14061 |
所有形状 | 0.978735474 | 0.42589872 | 0.742885495 | 201.26739 |
小型厨房电器 | 0.398853939 | 0.004907405 | 0.02514159 | 25.50886 |
平滑子空间 | 0.642434783 | 0.198252944 | 0.19954272 | 2.06081 |
索尼AIBO机器人表面1 | 0.728057763 | 0.455518203 | 0.464021606 | 2.53491 |
索尼AIBO机器人表面2 | 0.589140522 | 0.172496802 | 0.11750294 | 4.86348 |
星光曲线 | 0.769194065 | 0.520688962 | 0.610221341 | 64.50148 |
草莓 | 0.504165518 | -0.019398783 | 0.123396507 | 6.72441 |
瑞典叶 | 0.890254013 | 0.312306779 | 0.556179611 | 58.87581 |
符号 | 0.880314418 | 0.619222941 | 0.757594317 | 23.11830 |
合成控制 | 0.881984975 | 0.600681896 | 0.712533175 | 6.90626 |
脚趾分割1 | 0.50200682 | 0.004059369 | 0.005057191 | 1.78287 |
脚趾分割2 | 0.635618839 | 0.260242738 | 0.191505717 | 1.96561 |
轨迹 | 0.711065327 | 0.455900994 | 0.598951999 | 2.30357 |
双导联心电图 | 0.538024968 | 0.076155916 | 0.059000693 | 8.53791 |
两种模式 | 0.677979172 | 0.207830772 | 0.318418523 | 185.70084 |
UMD | 0.597057728 | 0.130992637 | 0.189184137 | 0.93842 |
UWave手势库全部 | 0.90364952 | 0.576024048 | 0.662693972 | 288.38747 |
UWave手势库X | 0.85435587 | 0.353963525 | 0.457132359 | 348.93967 |
UWave手势库Y | 0.830476288 | 0.24845414 | 0.342123959 | 471.75583 |
UWave手势库Z | 0.849091206 | 0.350080637 | 0.46397562 | 448.39118 |
晶圆 | 0.541995609 | 0.026459678 | 0.010367784 | 41.34034 |
葡萄酒 | 0.496478296 | -0.005187919 | 0.001056479 | 0.57659 |
同义词 | 0.892537036 | 0.221578306 | 0.451754722 | 74.17649 |
蠕虫 | 0.647528127 | 0.028458575 | 0.062591393 | 24.33412 |
蠕虫两类 | 0.503616566 | 0.00695446 | 0.009827969 | 8.10779 |
瑜伽 | 0.499909412 | -0.000340663 | 7.76E-05 | 146.22124 |
多变量结果:
数据集 | RI | ARI | NMI | 运行时间(秒) |
---|---|---|---|---|
发音词识别 | 0.97284653 | 0.682936 | 0.864209 | 2532.5272 |
心房颤动 | 0.560919540 | 0.01633812 | 0.128106259 | 76.43405 |
基本动作 | 0.725 | 0.3090610 | 0.4459239 | 38.9816293 |
字符轨迹 | 0.9365907 | 0.459423 | 0.7025514 | 6976.2988 |
板球 | 0.93382991 | 0.624538 | 0.82573024 | 1370.6116316 |
鸭鸭鹅 | 0.625656 | 0.01100873 | 0.08130332 | 13447.5819 |
电子环 | 0.87868450 | 0.5742014 | 0.647674 | 291.04038 |
癫痫 | 0.81000 | 0.50352 | 0.54805851 | 83.565232 |
乙醇浓度 | 0.59969 | -0.00394874 | 0.0010586 | 471.00570 |
人脸检测 | 0.50010 | 0.000212347 | 0.0002300 | 54983.670330 |
手指运动 | 0.5025486 | 0.0050935 | 0.005415024 | 977.18741 |
手部运动方向 | 0.600674 | 0.04846741 | 0.05801 | 942.49626 |
手写 | 0.916650 | 0.120414 | 0.40797 | 2304.06015 |
心跳 | 0.502037 | 0.0040379 | 0.0032608 | 4857.40035 |
昆虫翅膀振动 | 0.65513 | 0.00222 | 0.01020 | 705605.323 |
日语元音 | 0.859639 | 0.314733 | 0.4591541 | 1286.313125 |
LSST | 0.760442 | 0.0608486 | 0.124402 | 5608.39906 |
利伯拉斯 | 0.90685 | 0.30682997 | 0.560319 | 437.0877 |
运动想象 | 0.49957194 | 0.00049311 | 0.0033261 | 18263.257795 |
NATOPS | 0.82175796 | 0.3739007 | 0.45782146 | 265.831900 |
笔画数字 | 0.9146977 | 0.573592 | 0.69841860 | 5172.9306 |
音素谱 | 0.807146 | 0.0143122 | 0.08947696 | 28615.90575 |
球拍运动 | 0.7666819 | 0.38386 | 0.442636255 | 289.75656 |
自我调节SCP1 | 0.515991 | 0.032366 | 0.035956 | 543.48927 |
自我调节SCP2 | 0.498805 | -0.002369 | 0.00018238 | 1194.50309 |
口语阿拉伯数字 | 0.95415 | 0.7455001 | 0.80269645 | 12275.5243 |
站立行走跳跃 | 0.4957264 | 0.040850354 | 0.16682 | 412.409304 |
UWave手势库 | 0.86596 | 0.474113616 | 0.629729 | 184.98871 |