18. August 2025
Interleaving for Retrieval Augmented Generation
Hello friends. This is part 1 of a two-part series covering retriever interleaving and summary LLM model selection. In this part our goal is to decide which retriever to use when confronted with multiple engines or configurations.
This is an important article, because when a summary is present less people click on results. This means we have less data to use when iterating on retrieval engines and configurations. Since everyone is showing summaries these days, I outline a pragmatic approach to choosing a search retriever from two or three candidates when using RAG, and how to use interleaving to iterate over configurations with one or more LLMs providing stake in the choice.
Furthermore, I will show that Bing and Brave absolutely TROUNCE Google in a showdown for use as context in a summary, that you need to use multiple pages of results for success, and how to effectively measure outcomes to make retriever decisions. In part 2, I will outline how to select an LLM using the results we gather in this part 1.
First we ask the question: What kinds of search results work best for an LLM during RAG? When asking this, the first thing that jumps into most minds is that of relevance. But the answer is more complex.
When exploring the best solution to this question, comparing outcomes of retrieval engine configurations one at a time with A/B testing poses significant challenges. This is because A/B testing and contrasting metrics deny you of understanding true comparative preference, and unless the same queries are used enough by multiple people, the data is too sparse and you are unable to make trustworthy conclusions.