Open Engineer

What is text retrieval?

Imagine you're in a library with millions of books, and you're looking for one specific line from a play by Shakespeare. You could spend hours, days, or even years searching. Now, imagine you could find that line within milliseconds. That's what text retrieval does for us; it's the linchpin of any search engine and the unsung hero of our modern, information-saturated era.

Need to search medical records? Engineering docs for your entire company? A collection of recipes? Or every single paragraph in English Wikipedia? Need to do it at scale, accurately, and in milliseconds? Text retrieval is the right tool for the job.

More recently, the topic of text retrieval has received a lot of attention in the domain of AI assistants and Q&A chatbots. Consider the scenario where you'd like to use an LLM like GPT-4 to answer questions about a large corpus of text, such as a large collection of scientific papers or thousands of recent news articles. Since this entire corpus won't be able to fit within the limited context window of an LLM, you'd need a fast way to retrieve only the most relevant articles. Once again, this is where text retrieval comes in.

A Very Basic Definition

Text retrieval is the task of taking a query and retrieving the most relevant documents from a corpus of text. A query typically refers to a short string of text, such as a question or a search term, while a document refers to a longer piece of text, such as a paragraph or article, and a corpus refers to a collection of documents. In the case of a search engine like Google, the query is the search term, the documents are web pages, and the corpus is the entire internet.

A Very Basic Example

At the most rudimentary level, we can implement a basic text retrieval system using a simple keyword-matching algorithm. If we define our corpus as a list of short phrases (our documents), we can then retrieve relevant documents by finding all phrases that contain a matching substring (our query). Here's a simple Python example for illustration:

Python

# define our corpus of documents
corpus = [
	"the cat in the hat",
	"the lion, the witch and the wardrobe",
	"green eggs and ham"
]

# define our query
query = "hat"

# perform text retrieval by finding all documents that contain the query
matching_docs = [document for document in corpus if query in document.split()] print(matching_docs)

But, as you may imagine, this is a very naive way to uncover the most relevant matches. In reality, text retrieval is not just a simple Ctrl+F on a text document. When we move into the realm of terabytes of data, millions of users, and complex queries, things get interesting. You start considering relevance, ranking, context, and a myriad of other factors that affect the results.