Awesome Scientific Language Models

A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, materials science, biology, medicine, geoscience), covering different model sizes (from 100M to 100B parameters) and modalities (e.g., language, graph, vision, table, molecule, protein, genome, climate time series).

The repository is part of our survey paper A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery and will be continuously updated.

NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.

NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.

NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out to yuz9@illinois.edu or submit a pull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.

General
- Language
- Language + Graph
Mathematics
Physics
- Language
Chemistry and Materials Science
Biology and Medicine
Geography, Geology, and Environmental Science

General

<h3 id="general-language">Language</h3>

(SciBERT) SciBERT: A Pretrained Language Model for Scientific Text EMNLP 2019
[Paper] [GitHub] [Model (Base)]
(SciGPT2) Explaining Relationships between Scientific Documents ACL 2021
[Paper] [GitHub] [Model (117M)]
(CATTS) TLDR: Extreme Summarization of Scientific Documents EMNLP 2020 Findings
[Paper] [GitHub] [Model (406M)]
(SciNewsBERT) SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-Checking CIKM 2021
[Paper] [Model (Base)]
(ScholarBERT) The Diminishing Returns of Masked Language Models to Science ACL 2023 Findings
[Paper] [Model (Large)] [Model (770M)]
(AcademicRoBERTa) A Japanese Masked Language Model for Academic Domain COLING 2022 Workshop
[Paper] [GitHub] [Model (125M)]
(Galactica) Galactica: A Large Language Model for Science arXiv 2022
[Paper] [Model (125M)] [Model (1.3B)] [Model (6.7B)] [Model (30B)] [Model (120B)]
(DARWIN) DARWIN Series: Domain Specific Large Language Models for Natural Science arXiv 2023
[Paper] [GitHub] [Model (7B)]
(FORGE) FORGE: Pre-training Open Foundation Models for Science SC 2023
[Paper] [GitHub] [Model (1.4B, General)] [Model (1.4B, Biology/Medicine)] [Model (1.4B, Chemistry)] [Model (1.4B, Engineering)] [Model (1.4B, Materials Science)] [Model (1.4B, Physics)] [Model (1.4B, Social Science/Art)] [Model (13B, General)] [Model (22B, General)]
(SciGLM) SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning arXiv 2024
[Paper] [GitHub] [Model (6B)]

<h3 id="general-language-graph">Language + Graph</h3>

(SPECTER) SPECTER: Document-level Representation Learning using Citation-informed Transformers ACL 2020
[Paper] [GitHub] [Model (Base)]
(OAG-BERT) OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services KDD 2022
[Paper] [GitHub]
(ASPIRE) Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity NAACL 2022
[Paper] [GitHub] [Model (Base)]
(SciNCL) Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings EMNLP 2022
[Paper] [GitHub] [Model (Base)]
(SPECTER 2.0) SciRepEval: A Multi-Format Benchmark for Scientific Document Representations EMNLP 2023
[Paper] [GitHub] [Model (113M)]
(SciPatton) Patton: Language Model Pretraining on Text-Rich Networks ACL 2023
[Paper] [GitHub]
(SciMult) Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding EMNLP 2023 Findings
[Paper] [GitHub] [Model (138M)]

Mathematics

<h3 id="mathematics-language">Language</h3>

(GenBERT) Injecting Numerical Reasoning Skills into Language Models ACL 2020
[Paper] [GitHub]
(MathBERT) MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education arXiv 2021
[Paper] [GitHub] [Model (Base)]
(MWP-BERT) MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving NAACL 2022 Findings
[Paper] [GitHub] [Model (Base)]
(BERT-TD) Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems ACL 2022 Findings
[Paper] [GitHub]
(GSM8K-GPT) Training Verifiers to Solve Math Word Problems arXiv 2021
[Paper] [GitHub]
(DeductReasoner) Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction ACL 2022
[Paper] [GitHub] [Model (125M)]
(NaturalProver) NaturalProver: Grounded Mathematical Proof Generation with Language Models NeurIPS 2022
[Paper] [GitHub]
(Minerva) Solving Quantitative Reasoning Problems with Language Models NeurIPS 2022
[Paper]
(Bhaskara) Lila: A Unified Benchmark for Mathematical Reasoning EMNLP 2022
[Paper] [GitHub] [Model (2.7B)]
(WizardMath) WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)]
(MAmmoTH) MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning ICLR 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] [Model (13B, LLaMA-2)] [[Model (70B,