awesome-pretrained-models-for-information-retrieval
A curated list of awesome papers related to pre-trained models for information retrieval (a.k.a., pre-training for IR). If I missed any papers, feel free to open a PR to include them! And any feedback and contributions are welcome!
Pre-training for IR
Survey Papers
- Pre-training Methods in Information Retrieval. Yixing Fan, Xiaohui Xie et.al. FnTIR 2022
- Dense Text Retrieval based on Pretrained Language Models: A Survey. Wayne Xin Zhao, Jing Liu et.al. Arxiv 2022
- Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al. M&C 2021
- Semantic Models for the First-stage Retrieval: A Comprehensive Review. Jiafeng Guo et.al. TOIS 2021
- A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al. IPM 2020
First Stage Retrieval
Sparse Retrieval
Neural term re-weighting
- Learning to Reweight Terms with Distributed Representations. Guoqing Zheng, Jamie Callan SIGIR 2015.(DeepTR)
- Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et.al. SIGIR 2020 short. [code] (DeepCT)
- Context-Aware Document Term Weighting for Ad-Hoc Search. Zhuyun Dai et.al. WWW 2020. [code] (HDCT)
- Learning Term Discrimination. Jibril Frej et.al. SIGIR 2020. (IDF-reweighting)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. Luyu Gao et.al. NAACL 2020. [code] (COIL)
- Learning Passage Impacts for Inverted Indexes. Antonio Mallia et.al. SIGIR 2021 short. [code] (DeepImapct)
Query or document expansion
- Document Expansion by Query Prediction. Rodrigo Nogueira et.al. [doc2query code, docTTTTTquery code] (doc2query, docTTTTTquery)
- Generation-Augmented Retrieval for Open-Domain Question Answering. Yuning Mao et.al. ACL 2021. [code] (query expansion with BART)
- Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation. Jeong et.al. arXiv 2021. [code] (unsupervised document expansion)
Sparse representation learning
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. Yang Bai, Xiaoguang Li et.al. Arxiv 2020. (SparTerm: Term importance distribution from MLM+Binary Term Gating)
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking., and v2. Thibault Formal et.al. SIGIR 2021. [code](SPLADE)
- Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. Kyoung-Rok Jang et.al. EMNLP 2021. (UHD)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering. Ikuya Yamada et.al. ACL 2021. [code] (BPR, convert embedding vector to binary codes)
Dense Retrieval
Hard negative sampling
- Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR, in-batch negatives)
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. Arxiv 2020. [code] (RepBERT)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE, refresh index during training)
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. NAACL 2021. (RocketQA: cross-batch negatives, denoise hard negatives and data augementation)
- Optimizing Dense Retrieval Model Training with Hard Negatives. Jingtao Zhan et.al. SIGIR 2021.[code] (ADORE&STAR, query-side finetuning build on pretrained document encoders)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval Ruiyang Ren et.al. EMNLP Findings 2021. [code] (PAIR)
Late interaction and multi-vector representation
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
- Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
- Sparse, Dense, and Attentional Representations for Text Retrieval. Yi Luan, Jacob Eisenstein et.al. TACL 2020. (ME-BERT, multi-vectors)
- Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval. Hongyin Tang, Xingwu Sun et.al. ACL 2021.
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ACL 2021. [code] (DensePhrases)
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval. Shunyu Zhang et.al. ACL 2022. (MVR)
- Multivariate Representation Learning for Information Retrieval. Hamed Zamani et.al. SIGIR 2023. (Learn multivariate distributions)
Knowledge distillation
- Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard, Edouard Grave. ICLR 2020. [unofficial code] (Distill cross-attention of reader to retriever)
- Distilling Knowledge for Fast Retrieval-based Chat-bots. Amir Vakili Tahami et.al. SIGIR 2020. [code] (Distill from cross-encoders to bi-encoders)
- Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. Sebastian Hofstätter et.al. Arxiv 2020. [code] (Distill from BERT ensemble)
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin. Arxiv 2020. [code] (TCTColBERT: distill from ColBERT)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. Sebastian Hofstätter et.al. SIGIR 2021.[code] (TAS-Balanced, sample from query cluster and distill from BERT ensemble)
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. Ruiyang Ren, Yingqi Qu et.al. EMNLP 2021. [code] (RocketQAv2, joint learning by distillation)
- Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval. Kelong Mao et.al.