在TensorRT上部署BEV 3D检测

本仓库是BEV 3D检测（包括BEVFormer、BEVDet）在TensorRT上的部署项目，支持FP32/FP16/INT8推理。同时，为了提高BEVFormer在TensorRT上的推理速度，本项目实现了一些支持nv_half、nv_half2和INT8的TensorRT算子。在几乎不影响精度的情况下，BEVFormer base的推理速度可以提高四倍以上，引擎大小可以减少90%以上，GPU内存使用可以节省80%以上。此外，该项目还支持MMDetection中常见的2D目标检测模型，只需少量代码更改即可支持INT8量化和TensorRT部署。

基准测试

BEVFormer

BEVFormer PyTorch

模型	数据集	批次大小	NDS/mAP	FPS	大小 (MB)	内存 (MB)	设备
BEVFormer tiny 下载	NuScenes	1	NDS: 0.354 mAP: 0.252	15.9	383	2167	RTX 3090
BEVFormer small 下载	NuScenes	1	NDS: 0.478 mAP: 0.370	5.1	680	3147	RTX 3090
BEVFormer base 下载	NuScenes	1	NDS: 0.517 mAP: 0.416	2.4	265	5435	RTX 3090

带MMDeploy插件的BEVFormer TensorRT（仅支持FP32）

模型	数据集	批量大小	浮点/整型	量化方法	NDS/mAP	FPS	大小 (MB)	内存 (MB)	设备
BEVFormer tiny	NuScenes	1	FP32	-	NDS: 0.354 mAP: 0.252	37.9 (x1)	136 (x1)	2159 (x1)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16	-	NDS: 0.354 mAP: 0.252	69.2 (x1.83)	74 (x0.54)	1729 (x0.80)	RTX 3090
BEVFormer tiny	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.353 mAP: 0.249	65.1 (x1.72)	58 (x0.43)	1737 (x0.80)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.353 mAP: 0.249	70.7 (x1.87)	54 (x0.40)	1665 (x0.77)	RTX 3090
BEVFormer small	NuScenes	1	FP32	-	NDS: 0.478 mAP: 0.370	6.6 (x1)	245 (x1)	4663 (x1)	RTX 3090
BEVFormer small	NuScenes	1	FP16	-	NDS: 0.478 mAP: 0.370	12.8 (x1.94)	126 (x0.51)	3719 (x0.80)	RTX 3090
BEVFormer small	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.476 mAP: 0.367	8.7 (x1.32)	158 (x0.64)	4079 (x0.87)	RTX 3090
BEVFormer small	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.477 mAP: 0.368	13.3 (x2.02)	106 (x0.43)	3441 (x0.74)	RTX 3090
BEVFormer base *	NuScenes	1	FP32	-	NDS: 0.517 mAP: 0.416	1.5 (x1)	1689 (x1)	13893 (x1)	RTX 3090
BEVFormer base	NuScenes	1	FP16	-	NDS: 0.517 mAP: 0.416	1.8 (x1.20)	849 (x0.50)	11865 (x0.85)	RTX 3090
BEVFormer base *	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.516 mAP: 0.414	1.8 (x1.20)	426 (x0.25)	12429 (x0.89)	RTX 3090
BEVFormer base *	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.515 mAP: 0.414	2.2 (x1.47)	244 (x0.14)	11011 (x0.79)	RTX 3090

* 使用TensorRT-8.5.1.7时onnx2trt出现"内存不足"错误，但使用TensorRT-8.4.3.1可以成功转换。因此，这些引擎的版本是TensorRT-8.4.3.1。

BEVFormer TensorRT与自定义插件（支持nv_half、nv_half2和int8）

使用nv_half的FP16插件

模型	数据集	批次大小	浮点/整数	量化方法	NDS/mAP	FPS/提升	大小 (MB)	内存 (MB)	设备
BEVFormer tiny	NuScenes	1	FP32	-	NDS: 0.354 mAP: 0.252	40.0 (x1.06)	135 (x0.99)	1693 (x0.78)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16	-	NDS: 0.355 mAP: 0.252	81.2 (x2.14)	70 (x0.51)	1203 (x0.56)	RTX 3090
BEVFormer tiny	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.351 mAP: 0.249	90.1 (x2.38)	58 (x0.43)	1105 (x0.51)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.351 mAP: 0.249	107.4 (x2.83)	52 (x0.38)	1095 (x0.51)	RTX 3090
BEVFormer small	NuScenes	1	FP32	-	NDS: 0.478 mAP: 0.37	7.4 (x1.12)	250 (x1.02)	2585 (x0.55)	RTX 3090
BEVFormer small	NuScenes	1	FP16	-	NDS: 0.479 mAP: 0.37	15.8 (x2.40)	127 (x0.52)	1729 (x0.37)	RTX 3090
BEVFormer small	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.477 mAP: 0.367	17.9 (x2.71)	166 (x0.68)	1637 (x0.35)	RTX 3090
BEVFormer small	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.476 mAP: 0.366	20.4 (x3.10)	108 (x0.44)	1467 (x0.31)	RTX 3090
BEVFormer base	NuScenes	1	FP32	-	NDS: 0.517 mAP: 0.416	3.0 (x2.00)	292 (x0.17)	5715 (x0.41)	RTX 3090
BEVFormer base	NuScenes	1	FP16	-	NDS: 0.517 mAP: 0.416	4.9 (x3.27)	148 (x0.09)	3417 (x0.25)	RTX 3090
BEVFormer base	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.515 mAP: 0.414	6.9 (x4.60)	202 (x0.12)	3307 (x0.24)	RTX 3090
BEVFormer base	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.514 mAP: 0.413	8.0 (x5.33)	131 (x0.08)	2429 (x0.17)	RTX 3090

使用nv_half2的FP16插件

模型	数据集	批次大小	浮点/整型	量化方法	NDS/mAP	FPS	大小 (MB)	内存 (MB)	设备
BEVFormer tiny	NuScenes	1	FP16	-	NDS: 0.355 mAP: 0.251	84.2 (x2.22)	72 (x0.53)	1205 (x0.56)	RTX 3090
BEVFormer tiny	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.354 mAP: 0.250	108.3 (x2.86)	52 (x0.38)	1093 (x0.51)	RTX 3090
BEVFormer small	NuScenes	1	FP16	-	NDS: 0.479 mAP: 0.371	18.6 (x2.82)	124 (x0.51)	1725 (x0.37)	RTX 3090
BEVFormer small	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.477 mAP: 0.368	22.9 (x3.47)	110 (x0.45)	1487 (x0.32)	RTX 3090
BEVFormer base	NuScenes	1	FP16	-	NDS: 0.517 mAP: 0.416	6.6 (x4.40)	146 (x0.09)	3415 (x0.25)	RTX 3090
BEVFormer base	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.516 mAP: 0.415	8.6 (x5.73)	159 (x0.09)	2479 (x0.18)	RTX 3090

BEVDet

BEVDet PyTorch

模型	数据集	批次大小	NDS/mAP	FPS	大小 (MB)	内存 (MB)	设备
BEVDet R50 CBGS	NuScenes	1	NDS: 0.38 mAP: 0.298	29.0	170	1858	RTX 2080Ti

BEVDet TensorRT

使用自定义插件 bev_pool_v2（支持 nv_half、nv_half2 和 int8），修改自官方 BEVDet

模型	数据集	批次大小	浮点/整型	量化方法	NDS/mAP	FPS	大小 (MB)	内存 (MB)	设备
BEVDet R50 CBGS	NuScenes	1	FP32	-	NDS: 0.38 mAP: 0.298	44.6	245	1032	RTX 2080Ti
BEVDet R50 CBGS	NuScenes	1	FP16	-	NDS: 0.38 mAP: 0.298	135.1	86	790	RTX 2080Ti
BEVDet R50 CBGS	NuScenes	1	FP32/INT8	PTQ熵每张量	NDS: 0.355 mAP: 0.274	234.7	44	706	RTX 2080Ti
BEVDet R50 CBGS	NuScenes	1	FP16/INT8	PTQ熵每张量	NDS: 0.357 mAP: 0.277	236.4	44	706	RTX 2080Ti

2D 检测模型

本项目还支持 MMDetection 中的常见 2D 目标检测模型，只需进行少量修改。以下是 YOLOx 和 CenterNet 的部署示例。

YOLOx

模型	数据集	框架	批次大小	浮点/整数	量化方法	mAP	FPS	大小 (MB)	内存 (MB)	设备
YOLOx 下载	COCO	PyTorch	32	FP32	-	mAP: 0.506	63.1	379	7617	RTX 3090
YOLOx	COCO	TensorRT	32	FP32	-	mAP: 0.506	71.3 (x1)	546 (x1)	9943 (x1)	RTX 3090
YOLOx	COCO	TensorRT	32	FP16	-	mAP: 0.506	296.8 (x4.16)	192 (x0.35)	4567 (x0.46)	RTX 3090
YOLOx	COCO	TensorRT	32	FP32/INT8	PTQ熵每张量量化	mAP: 0.488	556.4 (x7.80)	99 (x0.18)	5225 (x0.53)	RTX 3090
YOLOx	COCO	TensorRT	32	FP16/INT8	PTQ熵每张量量化	mAP: 0.479	550.6 (x7.72)	99 (x0.18)	5119 (x0.51)	RTX 3090

CenterNet

模型	数据集	框架	批次大小	浮点/整数	量化方法	mAP	FPS	大小 (MB)	内存 (MB)	设备
CenterNet 下载	COCO	PyTorch	32	FP32	-	mAP: 0.299	337.4	56	5171	RTX 3090
CenterNet	COCO	TensorRT	32	FP32	-	mAP: 0.299	475.6 (x1)	58 (x1)	8241 (x1)	RTX 3090
CenterNet	COCO	TensorRT	32	FP16	-	mAP: 0.297	1247.1 (x2.62)	29 (x0.50)	5183 (x0.63)	RTX 3090
CenterNet	COCO	TensorRT	32	FP32/INT8	PTQ熵每张量量化	mAP: 0.27	1534.0 (x3.22)	20 (x0.34)	6549 (x0.79)	RTX 3090
CenterNet	COCO	TensorRT	32	FP16/INT8	PTQ熵每张量量化	mAP: 0.285	1889.0 (x3.97)	17 (x0.29)	6453 (x0.78)	RTX 3090

克隆

git clone git@github.com:DerryHub/BEVFormer_tensorrt.git
cd BEVFormer_tensorrt
PROJECT_DIR=$(pwd)

数据准备

MS COCO (用于2D检测)

下载 COCO 2017 数据集到 /path/to/coco 并解压。

cd ${PROJECT_DIR}/data
ln -s /path/to/coco coco

NuScenes 和 CAN bus (用于 BEVFormer)

从这里下载 nuScenes V1.0 完整数据集和 CAN bus 扩展数据到 /path/to/nuscenes 和 /path/to/can_bus。

按照 BEVFormer 的方式准备 nuscenes 数据。

cd ${PROJECT_DIR}/data
ln -s /path/to/nuscenes nuscenes
ln -s /path/to/can_bus can_bus

cd ${PROJECT_DIR}
sh samples/bevformer/create_data.sh

目录结构

${PROJECT_DIR}/data/.
├── can_bus
│   ├── scene-0001_meta.json
│   ├── scene-0001_ms_imu.json
│   ├── scene-0001_pose.json
│   └── ...
├── coco
│   ├── annotations
│   ├── test2017
│   ├── train2017
│   └── val2017
└── nuscenes
    ├── maps
    ├── samples
    ├── sweeps
    └── v1.0-trainval

安装

使用 Docker

cd ${PROJECT_DIR}
docker build -t trt85 -f docker/Dockerfile .
docker run -it --gpus all -v ${PROJECT_DIR}:/workspace/BEVFormer_tensorrt/ \
-v /path/to/can_bus:/workspace/BEVFormer_tensorrt/data/can_bus \
-v /path/to/coco:/workspace/BEVFormer_tensorrt/data/coco \
-v /path/to/nuscenes:/workspace/BEVFormer_tensorrt/data/nuscenes \
--shm-size 8G trt85 /bin/bash

# 在容器内
cd /workspace/BEVFormer_tensorrt/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/usr
make -j$(nproc)
make install
cd /workspace/BEVFormer_tensorrt/third_party/bev_mmdet3d
python setup.py build develop --user

注意： 您可以在这里下载 Docker 镜像。

从源码安装

CUDA/cuDNN/TensorRT

按照 NVIDIA 的指引下载并安装 CUDA-11.6/cuDNN-8.6.0/TensorRT-8.5.1.7。

PyTorch

按照官方指南安装 PyTorch 和 TorchVision。

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116

MMCV-full

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.5.0
pip install -r requirements/optional.txt
MMCV_WITH_OPS=1 pip install -e .

MMDetection

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
git checkout v2.25.1
pip install -v -e .
# "-v" 表示详细输出
# "-e" 表示以可编辑模式安装项目，
# 这样对代码的任何本地修改都会立即生效，无需重新安装。

MMDeploy

git clone git@github.com:open-mmlab/mmdeploy.git
cd mmdeploy
git checkout v0.10.0

git clone git@github.com:NVIDIA/cub.git third_party/cub
cd third_party/cub
git checkout c3cceac115

# 回到 third_party 目录并克隆 pybind11
cd ..
git clone git@github.com:pybind/pybind11.git pybind11
cd pybind11
git checkout 70a58c5

构建 MMDeploy 的 TensorRT 插件

确保 cmake 版本 >= 3.14.0，gcc 版本 >= 7。

export MMDEPLOY_DIR=/MMDeploy的根目录路径
export TENSORRT_DIR=/tensorrt的路径
export CUDNN_DIR=/cuda的路径

export LD_LIBRARY_PATH=$TENSORRT_DIR/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$CUDNN_DIR/lib64:$LD_LIBRARY_PATH

cd ${MMDEPLOY_DIR}
mkdir -p build
cd build
cmake -DCMAKE_CXX_COMPILER=g++-7 -DMMDEPLOY_TARGET_BACKENDS=trt -DTENSORRT_DIR=${TENSORRT_DIR} -DCUDNN_DIR=${CUDNN_DIR} ..
make -j$(nproc) 
make install

安装 MMDeploy

cd ${MMDEPLOY_DIR}
pip install -v -e .
# "-v" 表示详细输出
# "-e" 表示以可编辑模式安装项目，
# 这样对代码的任何本地修改都会立即生效，无需重新安装。

安装本项目

cd ${PROJECT_DIR}
pip install -r requirements.txt

构建并安装自定义 TensorRT 插件

注意：CUDA>=11.4，SM 版本>=7.5

cd ${PROJECT_DIR}/TensorRT/build
cmake .. -DCMAKE_TENSORRT_PATH=/TensorRT的路径
make -j$(nproc)
make install

运行自定义 TensorRT 插件的单元测试

cd ${PROJECT_DIR}
sh samples/test_trt_ops.sh

构建并安装 MMDetection3D 中的部分操作

cd ${PROJECT_DIR}/third_party/bev_mmdet3d
python setup.py build develop

准备检查点

将上述 PyTorch 检查点下载到 ${PROJECT_DIR}/checkpoints/pytorch/。ONNX 文件和 TensorRT 引擎将保存在 ${PROJECT_DIR}/checkpoints/onnx/ 和 ${PROJECT_DIR}/checkpoints/tensorrt/。

自定义 TensorRT 插件

支持 BEVFormer 中的常见 TensorRT 操作：

网格采样器
多尺度可变形注意力
调制可变形卷积2d
旋转
逆矩阵
BEV Pool V2
Flash 多头注意力

每个操作都实现了 2 个版本：FP32/FP16 (nv_half)/INT8 和 FP32/FP16 (nv_half2)/INT8。

具体的速度比较，请参见 自定义 TensorRT 插件。

运行

以下教程以 BEVFormer base 为例。

使用 PyTorch 评估

cd ${PROJECT_DIR}
# 默认 gpu_id 为 0
sh samples/bevformer/base/pth_evaluate.sh -d ${gpu_id}

使用 TensorRT 和 MMDeploy 插件评估

# 将.pth转换为.onnx
sh samples/bevformer/base/pth2onnx.sh -d ${gpu_id}
# 将.onnx转换为TensorRT引擎（FP32）
sh samples/bevformer/base/onnx2trt.sh -d ${gpu_id}
# 将.onnx转换为TensorRT引擎（FP16）
sh samples/bevformer/base/onnx2trt_fp16.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP32）
sh samples/bevformer/base/trt_evaluate.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP16）
sh samples/bevformer/base/trt_evaluate_fp16.sh -d ${gpu_id}

# 量化
# 校准并将.onnx转换为TensorRT引擎（FP32/INT8）
sh samples/bevformer/base/onnx2trt_int8.sh -d ${gpu_id}
# 校准并将.onnx转换为TensorRT引擎（FP16/INT8）
sh samples/bevformer/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP32/INT8）
sh samples/bevformer/base/trt_evaluate_int8.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP16/INT8）
sh samples/bevformer/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# 量化感知训练
# 默认gpu_ids为0,1,2,3,4,5,6,7
sh samples/bevformer/base/quant_aware_train.sh -d ${gpu_ids}
# 然后按照训练后量化流程进行

使用TensorRT和自定义插件进行评估

# nv_half
# 将.pth转换为.onnx
sh samples/bevformer/plugin/base/pth2onnx.sh -d ${gpu_id}
# 将.onnx转换为TensorRT引擎（FP32）
sh samples/bevformer/plugin/base/onnx2trt.sh -d ${gpu_id}
# 将.onnx转换为TensorRT引擎（FP16-nv_half）
sh samples/bevformer/plugin/base/onnx2trt_fp16.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP32）
sh samples/bevformer/plugin/base/trt_evaluate.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP16-nv_half）
sh samples/bevformer/plugin/base/trt_evaluate_fp16.sh -d ${gpu_id}

# nv_half2
# 将.pth转换为.onnx
sh samples/bevformer/plugin/base/pth2onnx_2.sh -d ${gpu_id}
# 将.onnx转换为TensorRT引擎（FP16-nv_half2）
sh samples/bevformer/plugin/base/onnx2trt_fp16_2.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP16-nv_half2）
sh samples/bevformer/plugin/base/trt_evaluate_fp16_2.sh -d ${gpu_id}

# 量化
# nv_half
# 校准并将.onnx转换为TensorRT引擎（FP32/INT8）
sh samples/bevformer/plugin/base/onnx2trt_int8.sh -d ${gpu_id}
# 校准并将.onnx转换为TensorRT引擎（FP16-nv_half/INT8）
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP32/INT8）
sh samples/bevformer/plugin/base/trt_evaluate_int8.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP16-nv_half/INT8）
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16.sh -d ${gpu_id}

# nv_half2
# 校准并将.onnx转换为TensorRT引擎（FP16-nv_half2/INT8）
sh samples/bevformer/plugin/base/onnx2trt_int8_fp16_2.sh -d ${gpu_id}
# 使用TensorRT引擎进行评估（FP16-nv_half2/INT8）
sh samples/bevformer/plugin/base/trt_evaluate_int8_fp16_2.sh -d ${gpu_id}