CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks
Abstract
A generative retrieval model named CorpusBrain achieves superior performance in knowledge-intensive language tasks by encoding all corpus information into its parameters, eliminating the need for traditional indexing and enabling end-to-end optimization.
Knowledge-intensive language tasks (KILT) usually require a large body of information to provide correct answers. A popular paradigm to solve this problem is to combine a search system with a machine reader, where the former retrieves supporting evidences and the latter examines them to produce answers. Recently, the reader component has witnessed significant advances with the help of large-scale pre-trained generative models. Meanwhile most existing solutions in the search component rely on the traditional ``index-retrieve-then-rank'' pipeline, which suffers from large memory footprint and difficulty in end-to-end optimization. Inspired by recent efforts in constructing model-based IR models, we propose to replace the traditional multi-step search pipeline with a novel single-step generative model, which can dramatically simplify the search process and be optimized in an end-to-end manner. We show that a strong generative retrieval model can be learned with a set of adequately designed pre-training tasks, and be adopted to improve a variety of downstream KILT tasks with further fine-tuning. We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index. Empirical results show that CorpusBrain can significantly outperform strong baselines for the retrieval task on the KILT benchmark and establish new state-of-the-art downstream performances. We also show that CorpusBrain works well under zero- and low-resource settings.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders (2026)
- GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression (2026)
- Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation (2026)
- Improving Answer Extraction in Context-based Question Answering Systems Using LLMs (2026)
- The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles (2026)
- Decoupling Knowledge and Task Subspaces for Composable Parametric Retrieval Augmented Generation (2026)
- Structures Facilitate Retrieve, Rerank, and Generate (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2208.07652 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper