Awesome Scientific Language Models
A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, materials science, biology, medicine, geoscience), covering different model sizes (from 100M to 100B parameters) and modalities (e.g., language, graph, vision, table, molecule, protein, genome, climate time series).
The repository is part of our survey paper A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery and will be continuously updated.
NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.
NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.
NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out to yuz9@illinois.edu or submit a pull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.
Contents
- General
- Mathematics
- Physics
- Chemistry and Materials Science
- Biology and Medicine
- Geography, Geology, and Environmental Science
General
Language
-
(SciBERT) SciBERT: A Pretrained Language Model for Scientific Text
EMNLP 2019
[Paper] [GitHub] [Model (Base)] -
(SciGPT2) Explaining Relationships between Scientific Documents
ACL 2021
[Paper] [GitHub] [Model (117M)] -
(CATTS) TLDR: Extreme Summarization of Scientific Documents
EMNLP 2020 Findings
[Paper] [GitHub] [Model (406M)] -
(SciNewsBERT) SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-Checking
CIKM 2021
[Paper] [Model (Base)] -
(ScholarBERT) The Diminishing Returns of Masked Language Models to Science
ACL 2023 Findings
[Paper] [Model (Large)] [Model (770M)] -
(AcademicRoBERTa) A Japanese Masked Language Model for Academic Domain
COLING 2022 Workshop
[Paper] [GitHub] [Model (125M)] -
(Galactica) Galactica: A Large Language Model for Science
arXiv 2022
[Paper] [Model (125M)] [Model (1.3B)] [Model (6.7B)] [Model (30B)] [Model (120B)] -
(DARWIN) DARWIN Series: Domain Specific Large Language Models for Natural Science
arXiv 2023
[Paper] [GitHub] [Model (7B)] -
(FORGE) FORGE: Pre-training Open Foundation Models for Science
SC 2023
[Paper] [GitHub] [Model (1.4B, General)] [Model (1.4B, Biology/Medicine)] [Model (1.4B, Chemistry)] [Model (1.4B, Engineering)] [Model (1.4B, Materials Science)] [Model (1.4B, Physics)] [Model (1.4B, Social Science/Art)] [Model (13B, General)] [Model (22B, General)] -
(SciGLM) SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning
arXiv 2024
[Paper] [GitHub] [Model (6B)]
Language + Graph
-
(SPECTER) SPECTER: Document-level Representation Learning using Citation-informed Transformers
ACL 2020
[Paper] [GitHub] [Model (Base)] -
(OAG-BERT) OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services
KDD 2022
[Paper] [GitHub] -
(ASPIRE) Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
NAACL 2022
[Paper] [GitHub] [Model (Base)] -
(SciNCL) Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
EMNLP 2022
[Paper] [GitHub] [Model (Base)] -
(SPECTER 2.0) SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
EMNLP 2023
[Paper] [GitHub] [Model (113M)] -
(SciPatton) Patton: Language Model Pretraining on Text-Rich Networks
ACL 2023
[Paper] [GitHub] -
(SciMult) Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding
EMNLP 2023 Findings
[Paper] [GitHub] [Model (138M)]
Mathematics
Language
-
(GenBERT) Injecting Numerical Reasoning Skills into Language Models
ACL 2020
[Paper] [GitHub] -
(MathBERT) MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education
arXiv 2021
[Paper] [GitHub] [Model (Base)] -
(MWP-BERT) MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving
NAACL 2022 Findings
[Paper] [GitHub] [Model (Base)] -
(BERT-TD) Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems
ACL 2022 Findings
[Paper] [GitHub] -
(GSM8K-GPT) Training Verifiers to Solve Math Word Problems
arXiv 2021
[Paper] [GitHub] -
(DeductReasoner) Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction
ACL 2022
[Paper] [GitHub] [Model (125M)] -
(NaturalProver) NaturalProver: Grounded Mathematical Proof Generation with Language Models
NeurIPS 2022
[Paper] [GitHub] -
(Minerva) Solving Quantitative Reasoning Problems with Language Models
NeurIPS 2022
[Paper] -
(Bhaskara) Lila: A Unified Benchmark for Mathematical Reasoning
EMNLP 2022
[Paper] [GitHub] [Model (2.7B)] -
(WizardMath) WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)] -
(MAmmoTH) MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
ICLR 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] [Model (13B, LLaMA-2)] [[Model (70B,