中文｜ English

CareGPT (关怀GPT)：医疗LLM，开源驱动，共创健康未来

资源整合 / 开源模型 / 丰富数据 / 高效部署

视频教程 安装部署 在线体验

⚡特性：

添加ChatGPT fine-tuning实现，推荐有额度的朋友在ChatGPT上进行微调实验；
支持ChatGPT-Next-Web部署微调的模型；
支持Gradio部署微调的模型；
支持LLaMA、LLaMA-2全系列模型训练；
支持LoRA、QLoRA，包括后续PPO、DPO强化学习训练；
支持模型与知识库结合问答；
开源了超过60个医院科室的导诊材料信息；
开发了支持GPT-4/ChatGPT模型蒸馏医学数据的工具，能够批量生成各种用于构建知识库和微调的数据；
聚合了丰富的开源医学LLM、LLM训练的医学数据、LLM部署资料、LLM测评以及相关LLM的资源整理；
我们参与了医学LLM的CMB榜单评测-IvyGPT，在测试中，我们领先ChatGPT及一众开源医学LLM；
我们基于自有数据集在不同基座LLM上训练开源了多个医疗LLM，您可以直接下载体验；

🎁数据集

预训练数据

监督训练数据

奖励训练数据

🗜️全流程训练

1.安装依赖

conda create -n llm python=3.11
conda activate llm
python -m pip install -r requirements.txt

LLaMA模型下载：https://blog.csdn.net/u014297502/article/details/129829677

# 转为HF格式
python -m transformers.models.llama.convert_llama_weights_to_hf \
    --input_dir path_to_llama_weights --model_size 7B --output_dir path_to_llama_model

LLaMA-2模型下载：https://huggingface.co/meta-llama

2.数据配置

数据集配置、PT、SFT、RW数据格式

dataset_info

如果您使用自定义数据集，请务必在 dataset_info.json 文件中以如下格式提供您的数据集定义。

"数据集名称": {
  "hf_hub_url": "HuggingFace上的项目地址（若指定，则忽略下列三个参数）",
  "script_url": "包含数据加载脚本的本地文件夹名称（若指定，则忽略下列两个参数）",
  "file_name": "该目录下数据集文件的名称（若上述参数未指定，则此项必需）",
  "file_sha1": "数据集文件的SHA-1哈希值（可选）",
  "columns": {
    "prompt": "数据集代表提示词的表头名称（默认：instruction）",
    "query": "数据集代表请求的表头名称（默认：input）",
    "response": "数据集代表回答的表头名称（默认：output）",
    "history": "数据集代表历史对话的表头名称（默认：None）"
  }
}

其中 prompt 和 response 列应当是非空的字符串。query 列的内容将会和 prompt 列拼接作为模型输入。history 列应当是一个列表，其中每个元素是一个字符串二元组，分别代表用户请求和模型答复。

PT example data

.txt格式，一行一个无监督数据。

Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.
Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

SFT example data 1

[
  {
    "instruction": "听起来很不错。人工智能可能在哪些方面面临挑战呢？",
    "input": "",
    "output": "人工智能面临的挑战包括数据隐私、安全和道德方面的问题，以及影响就业机会的自动化等问题。",
    "history": [
      ["你好，你能帮我解答一个问题吗？", "当然，请问有什么问题？"],
      ["我想了解人工智能的未来发展方向，你有什么想法吗？", "人工智能在未来的发展方向可能包括更强大的机器学习算法，更先进的自然语言处理技术，以及更加智能的机器人。"]
    ]
  }
]

SFT example data 2

[
  {
    "instruction": "听起来很不错。人工智能可能在哪些方面面临挑战呢？",
    "input": "",
    "output": "人工智能面临的挑战包括数据隐私、安全和道德方面的问题，以及影响就业机会的自动化等问题。",
    "history": []
  }
]

RW example data

[
  {
    "instruction": "生成三个与“道歉”意思相同的动词",
    "input": "",
    "output": [
      "承认，表示遗憾，弥补。",
      "道歉"
    ]
  }
]

3.训练配置

训练参数与指令

配置分布式

查看你的显卡是否是NVLINK连接，NVLINK连接才能有效使用accelerate进行并行加速训练。

nvidia-smi topo -m

accelerate config # 配置环境
accelerate launch src/train_bash.py # 传入参数 (同上)

监督训练

# LLaMA-2
accelerate
<SOURCE_TEXT>
# LLaMA
accelerate launch src/train_bash.py \
    --stage sft \
    --model_name_or_path ./Llama-7b-hf \
    --do_train \
    --dataset mm,hm \
    --finetuning_type lora \
    --overwrite_cache \
    --output_dir output-1 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 2000 \
    --learning_rate 5e-5 \
    --num_train_epochs 2.0 \
    --plot_loss \
    --fp16 \
    --template default \
    --lora_target q_proj,v_proj

强化学习

# LLaMA-2, DPO
accelerate launch src/train_bash.py \
    --stage dpo \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --do_train \
    --dataset rlhf \
    --template llama2 \
    --finetuning_type lora \
    --quantization_bit 4 \
    --lora_target q_proj,v_proj \
    --resume_lora_training False \
    --checkpoint_dir ./output-2 \
    --output_dir output-dpo \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-5 \
    --num_train_epochs 1.0 \
    --plot_loss \
    --fp16

4.推理配置

推理参数与指令

Web访问

# LLaMA-2
python src/web_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output \
    --finetuning_type lora \
    --template llama2

# LLaMA
python src/web_demo.py \
    --model_name_or_path ./Llama-7b-hf \
    --checkpoint_dir output-1 \
    --finetuning_type lora \
    --template default

# DPO
python src/web_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output-dpo \
    --finetuning_type lora \
    --template llama2

API访问

# LLaMA-2
python src/api_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output \
    --finetuning_type lora \
    --template llama2

# LLaMA
python src/api_demo.py \
    --model_name_or_path ./Llama-7b-hf \
    --checkpoint_dir output-1 \
    --finetuning_type lora \
    --template default

# DPO
python src/api_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output-dpo \
    --finetuning_type lora \
    --template llama2

测试API：

curl -X 'POST' \
    'http://127.0.0.1:8888/v1/chat/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "model": "string",
    "messages": [
      {
        "role": "user",
        "content": "你好"
      }
    ],
    "temperature": 0,
    "top_p": 0,
    "max_new_tokens": 0,
    "stream": false
  }'

CLI访问

# LLaMA-2
python src/cli_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output \
    --finetuning_type lora \
    --template llama2

# LLaMA
python src/cli_demo.py \
    --model_name_or_path ./Llama-7b-hf \
    --checkpoint_dir output-1 \
    --finetuning_type lora \
    --template default

# DPO
python src/cli_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output-dpo \
    --finetuning_type lora \
    --template llama2

批量预测

# LLaMA-2
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --do_predict \
    --dataset mm \
    --template llama2 \
    --finetuning_type lora \
    --checkpoint_dir output \
    --output_dir predict_output \
    --per_device_eval_batch_size 8 \
    --max_samples 100 \
    --predict_with_generate

# LLaMA
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path ./Llama-7b-hf \
    --do_predict \
    --dataset mm \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir output-1 \
    --output_dir predict_output \
    --per_device_eval_batch_size 8 \
    --max_samples 100 \
    --predict_with_generate

实验评估(BLEU和ROUGE_CHINESE)

# LLaMA-2
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --do_eval \
    --dataset mm \
    --template llama2 \
    --finetuning_type lora \
    --checkpoint_dir output \
    --output_dir eval_output \
    --per_device_eval_batch_size 8 \
    --max_samples 100 \
    --predict_with_generate

# LLaMA
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path ./Llama-7b-hf \
    --do_eval \
    --dataset mm \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir output-1 \
    --output_dir eval_output \
    --per_device_eval_batch_size 8 \
    --max_samples 100 \
    --predict_with_generate

在4/8-bit评估时，推荐使用--per_device_eval_batch_size=1和--max_target_length 128

5.Gradio部署

Gradio部署指令

模型导出

# LLaMA-2
python src/export_model.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --template llama2 \
    --finetuning_type lora \
    --checkpoint_dir output-1 \
    --output_dir output_export

# LLaMA
python src/export_model.py \
    --model_name_or_path ./Llama-7b-hf \
    --template default \
    --finetuning_type lora \
    --checkpoint_dir output \
    --output_dir output_export

开启运行

%cd Gradio
python app.py

6.ChatGPT-Next-Web部署

Next部署指令

开启API服务

# LLaMA-2
python src/api_demo.py \
    --model_name_or_path ./Llama-2-7b-chat-hf \
    --checkpoint_dir output \
    --finetuning_type lora \
    --template llama2

# LLaMA
python src/api_demo.py \
    --model_name_or_path ./Llama-7b-hf \
    --checkpoint_dir output-1 \
    --finetuning_type lora \
    --template default

下载Next并运行

下载Next：

修改配置：安装并打开Next，然后打开设置，修改接口地址为：http://127.0.0.1:8000/（即你的API接口地址），然后
The CareGPT model did not add or retrain Chinese segmentation for the tokenization model, but the results are still promising;
The full process of LLM training includes: pre-training, supervised fine-tuning, reward modeling, and reinforcement learning, in most cases, supervised fine-tuning alone can meet the needs;
If computing power is sufficient, it is recommended to use both medical data and general corpus data for training, so that the model can be trained with medical knowledge while maintaining general capabilities (such as instruction following);
Do not expect a single medical LLM to meet all needs. A reasonable approach may be a real-time updated knowledge base + a fine-tuned medical LLM (e.g., ChatLaw);
The BLOOMZ model series was trained using the PILE corpus, which includes various medical texts, such as PubMed Central and PubMed Abstracts. These valuable texts greatly enrich the BLOOMZ model's medical knowledge system, so many open-source projects prioritize using BLOOMZ as the base model for medical fine-tuning;
(2023.08.26) ChatGPT is trained based on code GPT, so would fine-tuning on downstream tasks using CodeLLaMA yield better results than fine-tuning on LLaMA-1/2?
Our recent work, combined with many recently published studies, proves that in the era of LLMs, the truth is quality > quantity of data, as shown in: Less is More! SJTU Qiyuan && Caspian | Using 200 data points to fine-tune the model, outperforming MiniGPT-4!, super-large-scale SFT data may weaken or lose ICL, CoT, and other abilities of LLMs in downstream tasks;
For vertical models, perhaps we should focus more on the PT process rather than collecting millions of SFT data for training. Our suggestion is large-scale pre-training + small-scale supervised fine-tuning = a super-strong LLM model;
A well-pretrained medical LLM has not yet been released in the open-source community, and we look forward to someone filling this gap;
Pre-training can infuse knowledge, while supervised fine-tuning only activates domain abilities (without focusing on knowledge)? Should the knowledge in pre-training and supervised fine-tuning echo each other? Would the knowledge from pre-training with dozens of GB of corpus be drowned out by the knowledge from models pre-trained with trillions of tokens?
Secondary pre-training with large-scale data requires a mix of other types of data: (1) After the language model training is completed, the areas responsible for each parameter are already determined. If a large amount of knowledge that was not present in the pre-training is added, it may cause significant parameter changes, leading to a loss of overall language model capabilities; (2) Large-scale secondary pre-training requires adding 5-10 times the original pre-training data and mixing it for training together;
Instruction fine-tuning stages should not involve too many training rounds: (1) Training on small amounts of data for multiple epochs may cause changes in key language areas, leading to model failure; (2) For instruction fine-tuning aimed at improving specific tasks, to ensure that the key areas of the model's language capabilities are not significantly adjusted, it is necessary to add general instruction fine-tuning data or pre-training data;
Training data must be strictly controlled for noise: (1) If there is a small amount of continuous noise data in the pre-training data, such as continuous repetition of words or non-word sequences, it may cause specific dimension adjustments, leading to significant fluctuations in the model's overall PPL; (2) If there is a large amount of instruction segments in supervised fine-tuning that do not match the original large language model, it may also cause specific dimension adjustments in the model, leading to a significant decline in overall performance;
When fine-tuning large models with mixed data of multiple capabilities, it presents: high resource conflict and low resource gain, so mixing different data for fine-tuning requires certain engineering skills;
Generally speaking, there is a non-negligible performance difference between lora and full-tuning (e.g., LoRA results in 4-6% lower performance compared to full fine-tuning);
For 7B series models, it is recommended to prioritize full-parameter fine-tuning, while models with 13B or more parameters can use methods like LoRA, QLoRA, etc.;
Even if large parameter models are quantized, their capabilities can still be well maintained;
Although LLM training (or rather, any model trained on GPUs) inherently involves some randomness, the results of multiple training rounds are still very consistent;
If GPU memory is limited, QLoRA provides a cost-effective compromise. It saves 33% memory at the cost of a 39% increase in runtime;
When fine-tuning LLMs, the choice of optimizer is not a major factor affecting the results. Whether it's AdamW, SGD with a scheduler, or AdamW with a scheduler, the impact on the results is minimal;
Although Adam is often considered a memory-intensive optimizer because it introduces two new parameters for each model parameter, this does not significantly affect the peak memory demand of LLMs. This is because most of the memory will be allocated for large matrix multiplications rather than retaining extra parameters;
For static datasets, multiple iterations during multi-round training may not yield good results. This usually leads to overfitting, deteriorating the training outcome;
If combining LoRA, ensure that it is applied across all layers, not just in the Key and Value matrices, to maximize model performance;
Adjusting the LoRA rank and selecting an appropriate α value is crucial. Here's a small tip: try setting the α value to twice the rank value;
A single GPU with 14GB RAM can efficiently fine-tune a model with up to 7 billion parameters within a few hours. For static datasets, turning an LLM into an "all-rounder" that excels in all baseline tasks is almost impossible. Solving this problem requires diversified data sources or techniques other than LoRA;
According to the NeurIPS workshop recommendations, as of December 18, 2023, the recommended models for fine-tuning are Mistral-7B Chinese for English models below 10B, Yi-6B for models below 10B, and Qwen-14B and Yi-34B for models above 10B.

[!IMPORTANT] Welcome to contribute new experiences in ISSUE!

The methodology for 11~13 comes from 130 Billion Large Language Models Completely Lose Language Ability by Changing Only 1 Weight! Latest Research from Fudan University's Natural Language Processing Laboratory.

The methodology for 14 comes from How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition.

The methodology for 17~25 comes from LLM Optimization: Layer-wise Optimal Rank Adaptation (LORA) Chinese interpretation.

🧰 Model Open Source

Stage	Weight Description	Download Link	Features	Base Model	Fine-tuning Method	Dataset
🌟Supervised Fine-tuning	Multi-turn dialogue data trained based on LLaMA2-7b-Chat	⚙️CareLlama2-7b-chat-sft-multi、🧰CareLlama2-7b-multi	Excellent multi-turn dialogue capability	LLaMA2-7b-Chat	QLoRA	mm
Supervised Fine-tuning	Rich and efficient doctor-patient dialogue data trained based on LLaMA2-7b-Chat	⚙️CareLlama2-7b-chat-sft-med	Excellent patient disease diagnosis capability	LLaMA2-7b-Chat	QLoRA	hm
Supervised Fine-tuning	Mixed data trained based on LLaMA-7b	⚙️CareLlama1-7b-merge	Better medical dialogue capability	LLaMA-7b	LoRA	mm,hm
Supervised Fine-tuning	Mixed data trained based on LLaMA2-7b-Chat	⚙️CareLlama2-7b-merge、🧰CareLlama2-7b-merge-mix	Better medical dialogue capability	LLaMA2-7b-Chat	QLoRA	mm,hm
DPO		⚙️CareLlama2-7b-merge-dpo				rlhf
Supervised Fine-tuning	More mixed data trained based on LLaMA2-7b-Chat	⚙️CareLlama2-7b-super、🧰CareLlama2-7b-super-mix	Better medical dialogue capability	LLaMA2-7b-Chat	QLoRA	mm,ls,ks,mc,ms,qz,hm
Supervised Fine-tuning	Multi-turn dialogue data trained based on Baichuan-13B-Chat	⚙️Baichuan-13B-Chat-sft-multi	Excellent multi-turn dialogue capability	Baichuan-13B-Chat	QLoRA	mm
Supervised Fine-tuning	Mixed dialogue data trained based on Baichuan-13B-Chat	⚙️Baichuan-13B-Chat-sft-merge	Better doctor-patient dialogue capability	Baichuan-13B-Chat	QLoRA	mm,hm
Supervised Fine-tuning	Mixed dialogue data trained based on Baichuan-13B-Chat	⚙️Baichuan-13B-Chat-sft-super、🧰Baichuan-13B-Chat-sft-super-mix	Better doctor-patient dialogue capability	Baichuan-13B-Chat	QLoRA	mm,ls,ks,mc,ms,qz,hm
🌟Supervised Fine-tuning	Multi-turn dialogue data trained based on QWen-7B	🧰carellm	Excellent multi-turn dialogue capability	QWen-7B	QLoRA	mm
Supervised Fine-tuning	Multi-turn dialogue data trained based on QWen-14B-Chat	⚙️careqwen-14B-Chat-sft-multi	Excellent multi-turn dialogue capability	QWen-14B-Chat	QLoRA	mm
Supervised Fine-tuning	Multi-turn dialogue data trained based on InternLM-20B-Chat	⚙️careinternlm-20B-Chat-sft-multi、🧰careinternlm-20B-Chat-sft-multi-mix	Excellent multi-turn dialogue capability	InternLM-20B-Chat	QLoRA	mm
🌟Supervised Fine-tuning	Multi-turn dialogue data trained based on Baichuan2-13B-Chat	⚙️Baichuan2-13B-Chat-sft-multi、🧰Baichuan2-13B-Chat-sft-multi-mix	Excellent multi-turn dialogue capability	Baichuan2-13B-Chat	QLoRA	mm

Usage：

Download the corresponding base model;

If it is LLaMA, convert it to HF format, if it is LLaMA-2 and downloaded in HF format, no conversion is needed;

Download the weights you want to load;

Start using our model according to inference configuration;

💯 Model Evaluation

<SOURCE_TEXT>

模型	机构	分数
ShuKunGPT	数坤科技	64.44
GPT-4	OpenAI	58.37
Baichuan2-53B	百川智能	45.69
ChatGLM2-6B	智谱AI	44.91
Baichuan-13B-chat	百川智能	41.63
IvyGPT (Baichuan2-13B+10W)	澳门理工大学	38.54
ChatGPT	OpenAI	38.09
IvyGPT (Baichuan-13B+10W)	澳门理工大学	34.60
ChatGLM3-6B	智谱AI	33.76
HuatuoGPT (BLOOMZ)	香港中文大学 (深圳)	31.38
IvyGPT (Qwen-7B+PT-WiNGPT32亿+10W)	澳门理工大学	28.26
MedicalGPT	-	26.45
ChatMed-Consult	华东师范大学	21.71
Bentsao	哈尔滨工业大学	21.25
ChatGLM-Med	哈尔滨工业大学	20.67
IvyGPT (LLaMA-2-7B+220W)	澳门理工大学	18.55
DoctorGLM	上海科技大学	7.63
BianQue-2	华东师范大学	7.26

模型	非幻觉率
ERNIE-Bot	69.33%
Baichuan2-53B	68.22%
ChatGLM-Pro	61.33%
GPT-4-0613	53.11%
QWen-14B-Chat	46.89%
Baichuan2-13B-Chat	42.44%
Baichuan2-7B-Chat	40.67%
GPT3.5-turbo-0613	39.33%
ChatGLM2-6B	34.89%
Baichuan2-13B-base	33.78%
Baichuan-13B-Chat	31.33%
Baichuan-13B-base	25.33%
Baichuan2-7B-base	25.33%
Baichuan-7B-base	22.22%

参考自：2310.03368.pdf

📳结果演示

查看更多演示

🍰免责声明

本项目相关资源仅供学术研究之用，严禁用于商业用途。使用涉及第三方代码的部分时，请严格遵循相应的开源协议。模型生成的内容受模型计算、随机性和量化精度损失等因素影响，本项目无法对其准确性作出保证。即使本项目模型输出符合医学事实，也不能被用作实际医学诊断的依据。对于模型输出的任何内容，本项目不承担任何法律责任，亦不对因使用相关资源和输出结果而可能产生的任何损失承担责任。

🥂项目引用

CareGPT(原名CareLlama) 为MPU的医疗大语言模型IvyGPT的分支，其存在意义是探索医疗数据、医疗LLM训练与部署相关的工作研究。
本工作由澳门理工大学应用科学学院硕士研究生王荣胜、周瑞哲、陈浩铭完成，指导老师为檀韬副教授和王亚鹏副教授。
我们的工作（IvyGPT）已经被CMB论文引用，相关论文已经被提交至NAACL：https://openreview.net/pdf?id=rHDSaubv25

如果你使用了本项目的模型，数据或者代码，请声明以下引用：

@misc{wang2023caregpt,
      title={CareGPT: Medical LLM, Open Source Driven for a Healthy Future}, 
      author={Rongsheng Wang, Ruizhe Zhou, Haoming Chen, Yapeng Wang, Tao Tan},
      year={2023},
      publisher = {GitHub},
      journal = {GitHub repository},
      howpublished = {\url{https://github.com/WangRongsheng/CareGPT}},
}

@article{wang2023ivygpt,
  title={IvyGPT: InteractiVe Chinese pathwaY language model in medical domain},
  author={Wang, Rongsheng and Duan, Yaofei and Lam, ChanTong and Chen, Jiexi and Xu, Jiangsheng and Chen, Haoming and Liu, Xiaohong and Pang, Patrick Cheong-Iao and Tan, Tao},
  journal={arXiv preprint arXiv:2307.10512},
  year={2023}
}

@Misc{llama-factory,
  title = {LLaMA Factory},
  author = {hiyouga},
  howpublished = {\url{https://github.com/hiyouga/LLaMA-Factory}},
  year = {2023}
}