xLSTM: 扩展长短期记忆模型
关于
xLSTM 是一种新的基于原始 LSTM 想法的循环神经网络架构。通过指数门控、适当的归一化和稳定技术以及新的矩阵记忆,它克服了原始 LSTM 的限制,并在语言建模方面相比变压器或状态空间模型显示出有前景的性能。
最小安装
从文件 environment_pt220cu121.yaml
创建一个 conda 环境。只安装模型代码(即模块 xlstm
)作为包:
通过 pip 安装:
pip install xlstm
从 github 克隆:
git clone https://github.com/NX-AI/xlstm.git
cd xlstm
pip install -e .
需求
此包基于 PyTorch 并已为 >=1.8
版本测试。对于 sLSTM 的 CUDA 版本,你需要计算能力 >= 8.0,详见 https://developer.nvidia.com/cuda-gpus。为了获得一个经过良好测试的环境, 安装 environment_pt220cu121.yaml
如下:
conda env create -n xlstm -f environment_pt220cu121.yaml
conda activate xlstm
用法
对于非语言应用或集成到其他架构中,可以使用 xLSTMBlockStack
;对于语言建模或其他基于 token 的应用,可以使用 xLSTMLMModel
。
xLSTM Block Stack
xLSTMBLockStack
适用于现有项目中的替代主干。它类似于变压器块的堆栈,但使用 xLSTM 块:
import torch
from xlstm import (
xLSTMBlockStack,
xLSTMBlockStackConfig,
mLSTMBlockConfig,
mLSTMLayerConfig,
sLSTMBlockConfig,
sLSTMLayerConfig,
FeedForwardConfig,
)
cfg = xLSTMBlockStackConfig(
mlstm_block=mLSTMBlockConfig(
mlstm=mLSTMLayerConfig(
conv1d_kernel_size=4, qkv_proj_blocksize=4, num_heads=4
)
),
slstm_block=sLSTMBlockConfig(
slstm=sLSTMLayerConfig(
backend="cuda",
num_heads=4,
conv1d_kernel_size=4,
bias_init="powerlaw_blockdependent",
),
feedforward=FeedForwardConfig(proj_factor=1.3, act_fn="gelu"),
),
context_length=256,
num_blocks=7,
embedding_dim=128,
slstm_at=[1],
)
xlstm_stack = xLSTMBlockStack(cfg)
x = torch.randn(4, 256, 128).to("cuda")
xlstm_stack = xlstm_stack.to("cuda")
y = xlstm_stack(x)
y.shape == x.shape
如果你在使用 yaml 字符串/文件进行配置,也可以使用 dacite 创建配置数据类。如下代码片段与上面的代码相同:
from omegaconf import OmegaConf
from dacite import from_dict
from dacite import Config as DaciteConfig
from xlstm import xLSTMBlockStack, xLSTMBlockStackConfig
xlstm_cfg = """
mlstm_block:
mlstm:
conv1d_kernel_size: 4
qkv_proj_blocksize: 4
num_heads: 4
slstm_block:
slstm:
backend: cuda
num_heads: 4
conv1d_kernel_size: 4
bias_init: powerlaw_blockdependent
feedforward:
proj_factor: 1.3
act_fn: gelu
context_length: 256
num_blocks: 7
embedding_dim: 128
slstm_at: [1]
"""
cfg = OmegaConf.create(xlstm_cfg)
cfg = from_dict(data_class=xLSTMBlockStackConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))
xlstm_stack = xLSTMBlockStack(cfg)
x = torch.randn(4, 256, 128).to("cuda")
xlstm_stack = xlstm_stack.to("cuda")
y = xlstm_stack(x)
y.shape == x.shape
xLSTM 语言模型
xLSTMLMModel
是围绕 xLSTMBlockStack
的一个包装器,添加了 token 嵌入和 lm 头。
from omegaconf import OmegaConf
from dacite import from_dict
from dacite import Config as DaciteConfig
from xlstm import xLSTMLMModel, xLSTMLMModelConfig
xlstm_cfg = """
vocab_size: 50304
mlstm_block:
mlstm:
conv1d_kernel_size: 4
qkv_proj_blocksize: 4
num_heads: 4
slstm_block:
slstm:
backend: cuda
num_heads: 4
conv1d_kernel_size: 4
bias_init: powerlaw_blockdependent
feedforward:
proj_factor: 1.3
act_fn: gelu
context_length: 256
num_blocks: 7
embedding_dim: 128
slstm_at: [1]
"""
cfg = OmegaConf.create(xlstm_cfg)
cfg = from_dict(data_class=xLSTMLMModelConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))
xlstm_stack = xLSTMLMModel(cfg)
x = torch.randint(0, 50304, size=(4, 256)).to("cuda")
xlstm_stack = xlstm_stack.to("cuda")
y = xlstm_stack(x)
y.shape[1:] == (256, 50304)
实验
展示 sLSTM 相对于 mLSTM 优势的合成实验主要是奇偶任务和多查询联想回忆任务。奇偶任务只能通过 sLSTM 提供的状态跟踪能力来解决。而多查询联想回忆任务测试记忆能力,mLSTM 的矩阵记忆和状态扩展在这方面非常有益。在组合中,它们在这两个任务上表现良好。
要运行每个实验,可以在实验文件夹中运行 main.py
:
python experiments/main.py --config experiments/parity_xLSTM01.yaml # xLSTM[0:1], 仅 sLSTM
python experiments/main.py --config experiments/parity_xLSTM10.yaml # xLSTM[1:0], 仅 mLSTM
python experiments/main.py --config experiments/parity_xLSTM11.yaml # xLSTM[1:1], mLSTM 和 sLSTM
注意,训练循环不包含提前停止或测试评估。
引用
如果你使用这个代码库,或以其他方式发现我们的工作有价值,请引用 xLSTM 论文:
@article{xlstm,
title={xLSTM: Extended Long Short-Term Memory},
author={Beck, Maximilian and P{\"o}ppel, Korbinian and Spanring, Markus and Auer, Andreas and Prudnikova, Oleksandra and Kopp, Michael and Klambauer, G{\"u}nter and Brandstetter, Johannes and Hochreiter, Sepp},
journal={arXiv preprint arXiv:2405.04517},
year={2024}
}