deepsparse - 优化CPU上深度学习推理的高效稀疏性使用

项目介绍：DeepSparse

DeepSparse 是一个专为加速神经网络推理而设计的CPU推理运行时。它利用稀疏性技术，大幅提高在CPU硬件上的推理性能。通过与优化库SparseML结合使用，DeepSparse可以对模型进行剪枝和量化，显着提升性能。

深度稀疏模型支持

Neural Magic 最近推出了DeepSparse的LLM（大型语言模型）推理支持功能。新功能包括：

稀疏内核加速可实现的非结构化稀疏权重，提供更快速度和内存节省。
支持8位的权重与激活量化。
高效使用缓存的注意力键和值，最大限度减少内存移动。

如何使用DeepSparse

安装（需要Linux环境）：
```
pip install -U deepsparse-nightly[llm]
```

运行推理示例：

from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")

prompt="""
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is sparsity? ### Response:
"""
print(pipeline(prompt, max_new_tokens=75).generations[0].text)

# Sparsity is the property of a matrix or other data structure in which a large number of elements are zero and a smaller number of elements are non-zero. In the context of machine learning, sparsity can be used to improve the efficiency of training and prediction.