Scientific Large Language Models (Sci-LLMs)
This repository collects papers on scientific large language models, particularly in the domains of biology and chemistry.
😎 Welcome to recommend missing papers through
Adding Issues
orPull Requests
.
🔔 News
-
💥 [2024/07] We have updated our survey paper by incorporating the latest related works. Please refer to the revised version on arXiv.
-
💥 [2024/01] Our survey paper 'Scientific Large Language Models: A Survey on Biological & Chemical Domains' has been released on arXiv.
In this survey, we focus on scientific languages (i.e., textual, molecular, protein and genomic languages), as well as their combination (i.e., multimodal language).
🌟 Contents
- Scientific Large Language Models (Sci-LLMs)
📖 Textual Scientific Large Language Models (Text-Sci-LLMs)
Medical
2019.04
ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission, arXiv, Code2022.02
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records, arXiv, Model2022.12
BioMedLM, .stanford.edu, huggingface2023.05
A Study of Generative Large Language Model for Medical Research and Healthcare (GatorTronGPT), arXiv, Code2023.11
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models, arXiv, Code2024.03
Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks (Meerkat), arXiv2023.06
ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation, arXiv2023.10
Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model, arXiv, Code2023.03
ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge, arXiv, Code2023.04
HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge, arXiv, Code2023.05
HuatuoGPT, towards Taming Language Model to Be a Doctor, arXiv, Code2023.04
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data, arXiv, Code2023.08
Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue, arXiv, Code2023.04
PMC-LLaMA: Towards Building Open-source Language Models for Medicine, arXiv, Code2023.09
CPLLM: Clinical Prediction with Large Language Models, arXiv, Code2023.05
Towards Expert-Level Medical Question Answering with Large Language Models(Med-PaLM 2), Google Research, arXiv2023.05
Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding, arXivCode2023.04
DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task, arXiv, Code2023.10
BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT, arXiv, Code2024.01
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain, arXiv2024.02
Me LLaMA: Foundation Large Language Models for Medical Applications, arXiv, Code2024.02
BiMediX: Bilingual Medical Mixture of Experts LLM, arXiv, Code, Hugging Face
Biology
2019.04
BioELMo: Probing Biomedical Embeddings from Language Models, arXiv, Code2019.05
BioBERT: a pre-trained biomedical language representation model for biomedical text mining, arXiv, Code2019.07
Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets, arXiv, Code2020.10
BioMegatron: Larger Biomedical Domain Language Model, arXiv, Code2020.10
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, arXiv, Hugging Face2021.06
BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA, ACL Anthology, Code2022.03
LinkBERT: Pretraining Language Models with Document Links, arXiv, Code2023.03
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, arXiv, Code2023.08
BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine, arXiv, Code2023.09
BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials, arXiv2024.02
BioMistral: BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains, arXiv, Code
Chemistry
2021.06
Automated Chemical Reaction Extraction from Scientific Literature. Journal of Chemical Information and Modeling, Code2021.09
MatSciBERT: A materials domain language model for text mining and information extraction, npj Computational Materials, Code2022.09
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing, npj Computational Materials, Hugging Face2024.01
ChemDFM: Dialogue Foundation Model for Chemistry, arXiv, Model2024.02
ChemLLM: A Chemical Large Language Model, arXiv, Model2024.02
LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset, arXiv, Page, Model, Dataset2024.02
PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry, arXiv
Comprehensive
2019.09
SciBERT: A Pretrained Language Model for Scientific Text, arXiv, Code2023.05
The Diminishing Returns of Masked Language Models to Science, arXiv, Hugging Face2023.08
DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv, Code2024.01
SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning, arXiv, GitHub2024.03
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer, arXiv2024.05
INDUS: Effective and Efficient Language Models for Scientific Applications,arXiv
Datasets and Benchmarks
- The MIMIC dataset,
2016.05
. mimic-code, Data Descriptor: MIMIC-III, a freely accessible critical care database, Scientific Data - eICU-CRD.
2019.04
. The eICU Collaborative Research Database, a freely available multi-center database for critical care research, Scientific Data - cMedQA2,
2018.11
. Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection, IEEE Access