LL3DA 官方仓库
🏃 介绍 LL3DA
LL3DA 是一个能够在复杂的3D环境中响应视觉和文本互动的大型语言3D助手。
大型多模态模型(LMM)的最近进展使得各种人机交互应用成为可能。然而,开发能够在复杂且多样的3D环境中理解、推理和规划的LMM仍然是一个挑战性的话题,特别是考虑到需要理解对排列不变的点云3D场景表示。现有的工作寻求多视图图像的帮助,并将2D特征投影到3D空间作为3D场景表示。然而,这会导致巨大的计算开销和性能下降。在本文中,我们提出了LL3DA,一个以点云为直接输入的大型语言3D助手,可以响应文本指令和视觉提示。这有助于LMM更好地理解人类交互,并进一步消除在杂乱3D场景中的歧义。实验表明,LL3DA 取得了显著的结果,并在3D密集标注和3D问答方面超过了各种3D视觉语言模型。
🚩 新闻
- 2024-03-04. 💥 代码完全发布!现在你可以训练定制模型了!
- 2024-02-27. 🎉 LL3DA 被CVPR 2024接受!西雅图见!
- 2023-11-30. 📣 上传论文并初始化项目
TODO:
- 将我们的论文上传到arXiv并建立项目页面。
- 祈祷被接受。
- 上传所有代码和训练脚本。
- 发布预训练权重。(见检查点)
- 添加本地演示界面。
- 在更大规模的3D VL基准上进行训练并扩展模型。
⚡ 快速开始
环境设置
步骤1. 构建依赖项。 我们的代码在CUDA 11.6和Python 3.8.16下测试。要运行这些代码,首先需要安装以下包:
h5py
scipy
cython
plyfile
'trimesh>=2.35.39,<2.35.40'
'networkx>=2.2,<2.3'
'torch=1.13.1+cu116'
'transformers>=4.37.0'
然后,从源代码构建pointnet2
和加速的giou
:
cd third_party/pointnet2
python setup.py install
cd utils
python cython_compile.py build_ext --inplace
步骤2. 下载预训练嵌入。 从huggingface下载预处理的BERT嵌入权重,并将其存储在./bert-base-embedding
文件夹下。权重与官方 BERT 模型相同,我们只是修改了某些参数的名称。
数据准备
我们的仓库需要来自ScanNet的3D数据、自然语言注释和预训练的LLM权重。
步骤1. 下载并准备ScanNet 3D数据。
2024年7月1日更新: 您可以从这里下载预处理的数据。
- 按照这里的说明下载ScanNetV2数据集。
- 将
SCANNET_DIR
更改为data/scannet/batch_load_scannet_data.py
中的扫描文件夹,并运行以下命令。
cd data/scannet/
python batch_load_scannet_data.py
步骤2. 准备语言注释
要训练模型,您需要准备来自ScanRefer
、Nr3D
、ScanQA
和3D-LLM
的ScanNet部分的语言注释。
ScanRefer
。按照这里的命令下载ScanRefer
数据集。Nr3D
。按照这里的命令下载Nr3D
数据集,并预处理。ScanQA
。按照这里的命令下载ScanQA
数据集。3D-LLM
。数据位于这里。我们也共享了我们的预处理脚本这里。
我们将更新来自3D-LLM的最新发布数据(V3)。
最后,将文件组织到以下文件夹:
./data/
ScanRefer/
ScanRefer_filtered_train.json
ScanRefer_filtered_train.txt
ScanRefer_filtered_val.json
ScanRefer_filtered_val.txt
Nr3D/
nr3d_train.json
nr3d_train.txt
nr3d_val.json
nr3d_val.txt
<SOURCE_TEXT>
ScanQA/
ScanQA_v1.0_test_w_obj.json
ScanQA_v1.0_test_wo_obj.json
ScanQA_v1.0_train.json
ScanQA_v1.0_val.json
3D_LLM/ 3d_llm_embodied_dialogue_filtered_train.json 3d_llm_embodied_dialogue_filtered_val.json 3d_llm_embodied_planning_filtered_train.json 3d_llm_embodied_planning_filtered_val.json 3d_llm_scene_description_train.json 3d_llm_scene_description_val.json
**Step 3. \[Optional\] Download Pre-trained LLM weights.** If your server has no trouble auto-downloading weights from huggingface🤗, feel free to skip this step.
Download files from the `opt-1.3b` checkpoint (or any other decoder-only LLM) at [huggingface](https://huggingface.co/facebook/opt-1.3b/tree/main), and store them under the `./facebook/opt-1.3b` directory. Make sure the required files are downloaded:
./facebook/opt-1.3b/ config.json merges.txt pytorch_model.bin special_tokens_map.json tokenizer_config.json vocab.json
</details>
## 💻 Train your own models
**<font color="#dd0000">Updates 2024-07-01:</font>** The released version is slightly different from our paper implementation. In our released version, we *standardized the data format* and *dropped duplicated text annotations*. To reproduce our reported results, please use the scripts provided in `scripts-v0` to produce the generalist weights.
bash scripts-v0/opt-1.3b/train.generalist.sh
Our code should support **any decoder-only LLMs** (`facebook/opt-1.3b`, `gpt2-xl`, `meta-llama/Llama-2-7b` or even the **<font color="#dd0000">LATEST</font>** `Qwen/Qwen1.5-1.8B` and `Qwen/Qwen1.5-4B`). Check out the following table for recommended LLMs in different scales! **By default, the models are trained with eight GPUs.**
| <1B | 1B-4B | ~7B |
|:-------------------------:|:-------------------------:|:--------------------------------:|
| `gpt2`(124m) | `TinyLlama-1.1B`(1.1b) | `facebook/opt-6.7b`(6.7b) |
| `facebook/opt-125m`(125m) | `facebook/opt-1.3b`(1.3b) | `meta-llama/Llama-2-7b-hf`(6.7b) |
| `gpt2-medium`(355m) | `gpt2-xl`(1.6b) | `Qwen/Qwen1.5-7B`(7.7b) |
| `Qwen/Qwen1.5-0.5B`(620m) | `Qwen/Qwen1.5-1.8B`(1.8b) | - |
| `gpt2-large`(774m) | `facebook/opt-2.7b`(2.7b) | - |
| - | `microsoft/phi-2`(2.8b) | - |
| - | `Qwen/Qwen1.5-4B`(3.9b) | - |
We provide training scripts in the `scripts` folder with different LLM backends. Feel free to modify the hyper parameters in those commands.
For other LLM backends, please modify the commands manually by changing `--vocab` to other LLMs.
<details>
<summary><b>Training</b></summary>
To train the model as a 3D generalist: (We have also uploaded the pre-trained weights to [huggingface](https://huggingface.co/CH3COOK/LL3DA-weight-release/blob/main/ll3da-opt-1.3b.pth).)
```{bash}
bash scripts/opt-1.3b/train.generalist.sh
After the model is trained, you can tune the model on ScanQA for 3D Question Answering:
bash scripts/opt-1.3b/tuning.scanqa.sh
And, on ScanRefer / Nr3D for 3D Dense Captioning:
bash scripts/opt-1.3b/tuning.scanrefer.sh
bash scripts/opt-1.3b/tuning.nr3d.sh
You can also tune the model to predict bounding boxes for open vocabulary object detection!
bash scripts/opt-1.3b/tuning.ovdet.sh
Evaluation
To evaluate the model as a 3D generalist:
bash scripts/opt-1.3b/eval.generalist.sh
On ScanQA for 3D Question Answering:
bash scripts/opt-1.3b/eval.scanqa.sh
And, on ScanRefer / Nr3D for 3D Dense Captioning:
bash scripts/opt-1.3b/eval.scanrefer.sh
bash scripts/opt-1.3b/eval.nr3d.sh
📖 Citation
If you find our code or paper helpful, please consider starring ⭐ us and citing:
@misc{chen2023ll3da,
title={LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning},
author={Sijin Chen and Xin Chen and Chi Zhang and Mingsheng Li and Gang Yu and Hao Fei and Hongyuan Zhu and Jiayuan Fan and Tao Chen},
year={2023},
eprint={2311.18651},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgments
Thanks to Vote2Cap-DETR, 3D-LLM, Scan2Cap, and 3DETR. We borrow some of their codes and data.
License
This code is distributed under an MIT LICENSE. If there are any problem regarding our paper and code, feel free to open an issue! </SOURCE_TEXT>