Project Icon

SmartSim

为高性能计算环境优化的机器学习集成框架

SmartSim是为高性能计算(HPC)环境设计的工作流库,简化了PyTorch和TensorFlow等机器学习库在HPC模拟和应用中的使用。该框架能在HPC系统上启动机器学习基础设施,与用户工作负载并行运行。通过基础设施库和SmartRedis客户端,SmartSim实现了HPC应用与机器学习模型间的高效数据交换和远程执行,支持Fortran、C、C++和Python等多种语言,无需MPI即可实现运行时数据交换。



Home    Install    Documentation    Cray Labs    Contact    Join us on Slack!   


License GitHub last commit GitHub deployments PyPI - Wheel PyPI - Python Version GitHub tag (latest by date) Language Code style: black codecov Downloads


SmartSim

SmartSim is made up of two parts

  1. SmartSim Infrastructure Library (This repository)
  2. SmartRedis

The two library components are designed to work together, but can also be used independently.

SmartSim is a workflow library that makes it easier to use common Machine Learning (ML) libraries, like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and applications. SmartSim launches ML infrastructure on HPC systems alongside user workloads.

SmartRedis provides an API to connect HPC workloads, particularly (MPI + X) simulations, to the ML infrastructure, namely the The Orchestrator database, launched by SmartSim.

Applications integrated with the SmartRedis clients, written in Fortran, C, C++ and Python, can send data to and remotely request SmartSim infrastructure to execute ML models and scripts on GPU or CPU. The distributed Client-Server paradigm allows for data to be seamlessly exchanged between applications at runtime without the utilization of MPI.


Table of Contents


Quick Start

The documentation has a number of tutorials that make it easy to get used to SmartSim locally before using it on your system. Each tutorial is a Jupyter notebook that can be run through the SmartSim Tutorials docker image which will run a jupyter lab with the tutorials, SmartSim, and SmartRedis installed.

docker pull ghcr.io/craylabs/smartsim-tutorials:latest
docker run -p 8888:8888 ghcr.io/craylabs/smartsim-tutorials:latest
# click on link to open jupyter lab

SmartSim Infrastructure Library

The Infrastructure Library (IL), the smartsim python package, facilitates the launch of Machine Learning and simulation workflows. The Python interface of the IL creates, configures, launches and monitors applications.

Experiments

The Experiment object is the main interface of SmartSim. Through the Experiment users can create references to user applications called Models.

Hello World

Below is a simple example of a workflow that uses the IL to launch hello world program using the local launcher which is designed for laptops and single nodes.

from smartsim import Experiment

exp = Experiment("simple", launcher="local")

settings = exp.create_run_settings("echo", exe_args="Hello World")
model = exp.create_model("hello_world", settings)

exp.start(model, block=True)
print(exp.get_status(model))

Hello World MPI

The Experiment.create_run_settings method returns a RunSettings object which defines how a model is launched. There are many types of RunSettings supported by SmartSim.

  • RunSettings
  • MpirunSettings
  • SrunSettings
  • AprunSettings
  • JsrunSettings

The following example launches a hello world MPI program using the local launcher for single compute node, workstations and laptops.

from smartsim import Experiment

exp = Experiment("hello_world", launcher="local")
mpi_settings = exp.create_run_settings(exe="echo",
                                       exe_args="Hello World!",
                                       run_command="mpirun")
mpi_settings.set_tasks(4)

mpi_model = exp.create_model("hello_world", mpi_settings)

exp.start(mpi_model, block=True)
print(exp.get_status(model))

If an argument of run_command="auto" (the default) is passed to Experiment.create_run_settings, SmartSim will attempt to find a run command on the system with which it has a corresponding RunSettings class. If one can be found, Experiment.create_run_settings will instance and return an object of that type.


Experiments on HPC Systems

SmartSim integrates with common HPC schedulers providing batch and interactive launch capabilities for all applications:

  • Slurm
  • LSF
  • PBSPro
  • Local (for laptops/single node, no batch)

In addition, on Slurm and PBS systems, Dragon can be used as a launcher. Please refer to the documentation for instructions on how to insall it on your system and use it in SmartSim.

Interactive Launch Example

The following launches the same hello_world model in an interactive allocation.

# get interactive allocation (Slurm)
salloc -N 3 --ntasks-per-node=20 --ntasks 60 --exclusive -t 00:10:00

# get interactive allocation (PBS)
qsub -l select=3:ncpus=20 -l walltime=00:10:00 -l place=scatter -I -q <queue>

# get interactive allocation (LSF)
bsub -Is -W 00:10 -nnodes 3 -P <project> $SHELL

This same script will run on a SLURM, PBS, or LSF system as the launcher is set to auto in the Experiment initialization. The run command like mpirun, aprun or srun will be automatically detected from what is available on the system.

# hello_world.py
from smartsim import Experiment

exp = Experiment("hello_world_exp", launcher="auto")
run = exp.create_run_settings(exe="echo", exe_args="Hello World!")
run.set_tasks(60)
run.set_tasks_per_node(20)

model = exp.create_model("hello_world", run)
exp.start(model, block=True, summary=True)

print(exp.get_status(model))
# in interactive terminal
python hello_world.py

This script could also be launched in a batch file instead of an interactive terminal. For example, for Slurm:

#!/bin/bash
#SBATCH --exclusive
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=20
#SBATCH --time=00:10:00

python /path/to/hello_world.py
# on Slurm system
sbatch run_hello_world.sh

Batch Launch Examples

SmartSim can also launch workloads in a batch directly from Python, without the need for a batch script. Users can launch groups of Model instances in a Ensemble.

The following launches 4 replicas of the the same hello_world model.

# hello_ensemble.py
from smartsim import Experiment

exp = Experiment("hello_world_batch", launcher="auto")

# define resources for all ensemble members
batch = exp.create_batch_settings(nodes=4, time="00:10:00", account="12345-Cray")
batch.set_queue("premium")

# define how each member should run
run = exp.create_run_settings(exe="echo", exe_args="Hello World!")
run.set_tasks(60)
run.set_tasks_per_node(20)

ensemble = exp.create_ensemble("hello_world",
                               batch_settings=batch,
                               run_settings=run,
                               replicas=4)
exp.start(ensemble, block=True, summary=True)

print(exp.get_status(ensemble))
python hello_ensemble.py

Similar to the interactive example, this same script will run on a SLURM, PBS, or LSF system as the launcher is set to auto in the Experiment initialization. Local launching does not support batch workloads.


Infrastructure Library Applications

  • Orchestrator - In-memory data store and Machine Learning Inference (Redis + RedisAI)

Redis + RedisAI

The Orchestrator is an in-memory database that utilizes Redis and RedisAI to provide a distributed database and access to ML runtimes from Fortran, C, C++ and Python.

SmartSim provides classes that make it simple to launch the database in many configurations and optionally form a distributed database cluster. The examples below will show how to launch the database. Later in this document we will show how to use the database to perform ML inference and processing.

Local Launch

The following script launches a single database using the local launcher.

Experiment.create_database will initialize an Orchestrator instance corresponding to the specified launcher.

# run_db_local.py
from smartsim import Experiment

exp = Experiment("local-db", launcher="local")
db = exp.create_database(port=6780,       # database port
                         interface="lo")  # network interface to use

# by default, SmartSim never blocks execution after the database is launched.
exp.start(db)

# launch models, analysis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients

# stop the database
exp.stop(db)

Interactive Launch

The Orchestrator, like Ensemble instances, can be launched locally, in interactive allocations, or in a batch.

The following example launches a distributed (3 node) database cluster in an interactive allocation.

# get interactive allocation (Slurm)
salloc -N 3 --ntasks-per-node=1 --exclusive -t 00:10:00

# get interactive allocation (PBS)
qsub -l select=3:ncpus=1 -l walltime=00:10:00 -l place=scatter -I -q queue

# get interactive allocation (LSF)
bsub -Is -W 00:10 -nnodes 3 -P project $SHELL

# run_db.py
from smartsim import Experiment

# auto specified to work across launcher types
exp = Experiment("db-on-slurm", launcher="auto")
db_cluster = exp.create_database(db_nodes=3,
                                 db_port=6780,
                                 batch=False,
                                 interface="ipogif0")
exp.start(db_cluster)

print(f"Orchestrator launched on nodes: {db_cluster.hosts}")
# launch models, analysis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients

exp.stop(db_cluster)
# in interactive terminal
python run_db.py

Batch Launch

The Orchestrator can also be launched in a batch without the need for an interactive allocation. SmartSim will create the batch file, submit it to the batch system, and then wait for the database to be launched. Users can hit CTRL-C to cancel the launch if needed.

# run_db_batch.py
from smartsim import Experiment

exp = Experiment("batch-db-on-pbs", launcher="auto")
db_cluster = exp.create_database(db_nodes=3,
                                 db_port=6780,
                                 batch=True,
                                 time="00:10:00",
                                 interface="ib0",
                                 account="12345-Cray",
                                 queue="cl40")

exp.start(db_cluster)

print(f"Orchestrator launched on nodes: {db_cluster.hosts}")
# launch models, analysis, training, inference sessions, etc
# that communicate with the database using the SmartRedis clients

exp.stop(db_cluster)
python run_db_batch.py

SmartRedis

The SmartSim IL Clients (SmartRedis) are implementations of Redis clients that implement the RedisAI API with additions specific to scientific workflows.

SmartRedis clients are available in Fortran, C, C++, and Python. Users can seamlessly pull and push data from the Orchestrator from different languages.

Tensors

Tensors are the fundamental data structure for the SmartRedis clients. The Clients use the native array format of the language. For example, in Python, a tensor is a NumPy array while the C/C++ clients accept nested and contiguous arrays.

When stored in the database, all tensors are stored in the same format. Hence, any language can receive a tensor from the database no matter what supported language the array was sent from. This enables applications in different languages to communicate numerical data with each other at runtime.

For more information on the tensor data structure, see the documentation

Datasets

Datasets are collections of Tensors and associated metadata. The Dataset class is a user space object that can be created, added to, sent to, and retrieved from the Orchestrator.

For an example of how to use the Dataset class, see the Online Analysis example

For more information on the API, see the API documentation

SmartSim + SmartRedis Tutorials

SmartSim and SmartRedis were designed to work together. When launched through SmartSim, applications using the SmartRedis clients are directly connected to any Orchestrator launched in the same Experiment.

In this way, a SmartSim Experiment becomes a driver for coupled ML and Simulation workflows. The following are simple examples of how to use SmartSim and SmartRedis together.

Run the Tutorials

Each tutorial is a Jupyter notebook that can be run through the SmartSim Tutorials docker image which will run a jupyter

项目侧边栏1项目侧边栏2
推荐项目
Project Cover

豆包MarsCode

豆包 MarsCode 是一款革命性的编程助手,通过AI技术提供代码补全、单测生成、代码解释和智能问答等功能,支持100+编程语言,与主流编辑器无缝集成,显著提升开发效率和代码质量。

Project Cover

AI写歌

Suno AI是一个革命性的AI音乐创作平台,能在短短30秒内帮助用户创作出一首完整的歌曲。无论是寻找创作灵感还是需要快速制作音乐,Suno AI都是音乐爱好者和专业人士的理想选择。

Project Cover

有言AI

有言平台提供一站式AIGC视频创作解决方案,通过智能技术简化视频制作流程。无论是企业宣传还是个人分享,有言都能帮助用户快速、轻松地制作出专业级别的视频内容。

Project Cover

Kimi

Kimi AI助手提供多语言对话支持,能够阅读和理解用户上传的文件内容,解析网页信息,并结合搜索结果为用户提供详尽的答案。无论是日常咨询还是专业问题,Kimi都能以友好、专业的方式提供帮助。

Project Cover

阿里绘蛙

绘蛙是阿里巴巴集团推出的革命性AI电商营销平台。利用尖端人工智能技术,为商家提供一键生成商品图和营销文案的服务,显著提升内容创作效率和营销效果。适用于淘宝、天猫等电商平台,让商品第一时间被种草。

Project Cover

吐司

探索Tensor.Art平台的独特AI模型,免费访问各种图像生成与AI训练工具,从Stable Diffusion等基础模型开始,轻松实现创新图像生成。体验前沿的AI技术,推动个人和企业的创新发展。

Project Cover

SubCat字幕猫

SubCat字幕猫APP是一款创新的视频播放器,它将改变您观看视频的方式!SubCat结合了先进的人工智能技术,为您提供即时视频字幕翻译,无论是本地视频还是网络流媒体,让您轻松享受各种语言的内容。

Project Cover

美间AI

美间AI创意设计平台,利用前沿AI技术,为设计师和营销人员提供一站式设计解决方案。从智能海报到3D效果图,再到文案生成,美间让创意设计更简单、更高效。

Project Cover

稿定AI

稿定设计 是一个多功能的在线设计和创意平台,提供广泛的设计工具和资源,以满足不同用户的需求。从专业的图形设计师到普通用户,无论是进行图片处理、智能抠图、H5页面制作还是视频剪辑,稿定设计都能提供简单、高效的解决方案。该平台以其用户友好的界面和强大的功能集合,帮助用户轻松实现创意设计。

投诉举报邮箱: service@vectorlightyear.com
@2024 懂AI·鲁ICP备2024100362号-6·鲁公网安备37021002001498号