fast_rnnt

这个项目实现了一种更快、更节省内存的RNN-T损失计算方法，称为"pruned rnnt"。

注意：k2项目中也有一个快速RNN-T损失实现，它与这里的代码相同。我们将fast_rnnt作为一个独立项目，以便有人只需要这个rnnt损失。

pruned-rnnt是如何工作的？

我们首先使用一个简单的连接网络（仅仅是编码器和解码器的加法）来获得RNN-T递归的剪枝边界，然后我们使用这些剪枝边界来评估完整的非线性连接网络。

下图显示了格点节点的梯度（通过rnnt_loss_simple和return_grad=true获得）。在每个时间帧，只有一小部分节点有非零梯度，这证明了pruned RNN-T损失的合理性，即对每帧的符号数量设置限制。

此图片来自这里

安装

你可以通过pip安装：

pip install fast_rnnt

你也可以从源代码安装：

git clone https://github.com/danpovey/fast_rnnt.git
cd fast_rnnt
python setup.py install

要检查fast_rnnt是否安装成功，请运行

python3 -c "import fast_rnnt; print(fast_rnnt.__version__)"

这应该会打印出安装的fast_rnnt版本，例如1.0。

如何显示安装日志？

使用

pip install --verbose fast_rnnt

如何减少安装时间？

使用

export FT_MAKE_ARGS="-j"
pip install --verbose fast_rnnt

这将向make传递-j参数。

支持哪些版本的PyTorch？

已在PyTorch >= 1.5.0上进行了测试。

注意：PyTorch的cuda版本应与你环境中的cuda版本相同，否则会导致编译错误。

如何安装`fast_rnnt`的CPU版本？

使用

export FT_CMAKE_ARGS="-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF"
export FT_MAKE_ARGS="-j"
pip install --verbose fast_rnnt

这将向cmake传递-DCMAKE_BUILD_TYPE=Release -DFT_WITH_CUDA=OFF参数。

如果安装过程中遇到问题，在哪里可以获得帮助？

请在https://github.com/danpovey/fast_rnnt/issues提交问题，并在那里描述你的问题。

使用方法

rnnt_loss_simple

这是RNN-T损失的简单情况，其中连接网络仅仅是加法。

注意：termination_symbol在其他RNN-T损失实现中扮演blank的角色，我们称之为termination_symbol，因为它终止当前帧的符号。

am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0

boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames

loss = fast_rnnt.rnnt_loss_simple(
    lm=lm,
    am=am,
    symbols=symbols,
    termination_symbol=termination_symbol,
    boundary=boundary,
    reduction="sum",
)

rnnt_loss_smoothed

与rnnt_loss_simple相同，但支持am_only和lm_only平滑，允许你使损失函数具有以下形式：

      lm_only_scale * lm_probs +
      am_only_scale * am_probs +
      (1-lm_only_scale-am_only_scale) * combined_probs

其中lm_probs和am_probs分别是仅给定语言模型和声学模型的概率。

am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0

boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames

loss = fast_rnnt.rnnt_loss_smoothed(
    lm=lm,
    am=am,
    symbols=symbols,
    termination_symbol=termination_symbol,
    lm_only_scale=0.25,
    am_only_scale=0.0
    boundary=boundary,
    reduction="sum",
)

rnnt_loss_pruned

rnnt_loss_pruned不能单独使用，它需要rnnt_loss_simple/rnnt_loss_smoothed返回的梯度来获取剪枝边界。

am = torch.randn((B, T, C), dtype=torch.float32)
lm = torch.randn((B, S + 1, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0

boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames

# rnnt_loss_simple也可以是rnnt_loss_smoothed
simple_loss, (px_grad, py_grad) = fast_rnnt.rnnt_loss_simple(
    lm=lm,
    am=am,
    symbols=symbols,
    termination_symbol=termination_symbol,
    boundary=boundary,
    reduction="sum",
    return_grad=True,
)
s_range = 5  # 可以是其他值
ranges = fast_rnnt.get_rnnt_prune_ranges(
    px_grad=px_grad,
    py_grad=py_grad,
    boundary=boundary,
    s_range=s_range,
)

am_pruned, lm_pruned = fast_rnnt.do_rnnt_pruning(am=am, lm=lm, ranges=ranges)

logits = model.joiner(am_pruned, lm_pruned)
pruned_loss = fast_rnnt.rnnt_loss_pruned(
    logits=logits,
    symbols=symbols,
    ranges=ranges,
    termination_symbol=termination_symbol,
    boundary=boundary,
    reduction="sum",
)

你也可以在这里找到使用rnnt_loss_pruned训练模型的示例。

rnnt_loss

未剪枝的rnnt_loss与torchaudio rnnt_loss相同，对于相同的输入，它产生与torchaudio相同的输出。

logits = torch.randn((B, S, T, C), dtype=torch.float32)
symbols = torch.randint(0, C, (B, S))
termination_symbol = 0

boundary = torch.zeros((B, 4), dtype=torch.int64)
boundary[:, 2] = target_lengths
boundary[:, 3] = num_frames

loss = fast_rnnt.rnnt_loss(
    logits=logits,
    symbols=symbols,
    termination_symbol=termination_symbol,
    boundary=boundary,
    reduction="sum",
)

基准测试

这个仓库比较了几种transducer损失的速度和内存使用情况，下表中的摘要来自该仓库，你可以查看该仓库以获取更多详细信息。

注意：如上所述，fast_rnnt也在k2项目中实现，因此在基准测试中k2和fast_rnnt是等效的。

名称	平均步骤时间（微秒）	峰值内存使用（MB）
torchaudio	601447	12959.2
fast_rnnt(unpruned)	274407	15106.5
fast_rnnt(pruned)	38112	2647.8
optimized_transducer	567684	10903.1
warprnnt_numba	229340	13061.8
warp-transducer	210772	13061.8