awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pre-training for IR). If I missed any papers, feel free to open a PR to include them! And any feedback and contributions are welcome!

Pre-training for IR

Survey Papers
Phase 1: First-stage Retrieval
Sparse Retrieval
Dense Retrieval
Hybrid Retrieval
Phase 2: Re-ranking Stage
Basic Usage
Long Document Processing Techniques
Improving Efficiency
Other Topics
Jointly Learning Retrieval and Re-ranking
Model-based IR System
LLM and IR

Retrieval Augmented LLM
LLM for IR
Multimodal Retrieval

Unified Single-stream Architecture

Multi-stream Architecture Applied on Input
Other Resources

Survey Papers

Pre-training Methods in Information Retrieval. Yixing Fan, Xiaohui Xie et.al. FnTIR 2022
Dense Text Retrieval based on Pretrained Language Models: A Survey. Wayne Xin Zhao, Jing Liu et.al. Arxiv 2022
Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al. M&C 2021
Semantic Models for the First-stage Retrieval: A Comprehensive Review. Jiafeng Guo et.al. TOIS 2021
A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al. IPM 2020

First Stage Retrieval

Sparse Retrieval

Dense Retrieval

Hard negative sampling

Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR, in-batch negatives)
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT)
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. NAACL 2021. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren et.al. EMNLP Findings 2021. [code] (PAIR)

Late interaction and multi-vector representation

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan, Jacob Eisenstein et.al. TACL 2020. (ME-BERT, multi-vectors)
Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang, Xingwu Sun et.al. ACL 2021.
Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et.al. ACL 2022. (MVR)
Multivariate Representation Learning for Information Retrieval. Hamed Zamani et.al. SIGIR 2023. (Learn multivariate distributions)

Knowledge distillation

Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2, joint learning by distillation)
Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval. Kelong Mao et.al.

awesome-pretrained-models-for-information-retrieval

awesome-pretrained-models-for-information-retrieval

Pre-training for IR

Survey Papers

First Stage Retrieval

Sparse Retrieval

Neural term re-weighting

Query or document expansion

Sparse representation learning

Dense Retrieval

Hard negative sampling

Late interaction and multi-vector representation

Knowledge distillation