Machine learning has given us semantic search, vastly improving the capability of search. However, the black-box nature of these tools can make them unsuited for some domains, like academic literature review. How can those domains take advantage of better search, while still having reproducible results?
Full-text search has been the gold standard for decades, with powerful capabilities for this built into databases like Postgres. Full-text search uses complex heuristics for better matching (such as pluralization) and indexing to keep search fast in large datasets. This remained the state of the art for almost all apps until the arrival of general-purpose ML-powered search.
An early idea of mathematically representing words as their underlying concepts goes back as far as the 1950s. Semantic search, as this is now known, focuses on searching for ideas rather than words. Major milestones include Word2vec (2013) and Google’s BERT (2018).
Semantic search works by turning the user’s search query from text into a high-dimensional vector, basically just several hundred numbers in an array. This words-to-vector conversion is known as an embedding, and the conceptual search space the vector implies is called the latent space. To handle the nuanced and messy semantics of natural language, we need a more powerful tool. This is where large language models come into play.
Rather than processing each word independently, the text is split into chunks and fed as a sequence into the LLM, allowing the model to perceive the richer semantics woven into our languages. The result is still an embedding, but it’s one which better captures the meaning of the input.
Despite being a relatively new and incredibly powerful tool, the generation of high-quality embeddings has quickly become commoditized. For example, using OpenAI’s embeddings API you can easily add Google-quality search to your app. Simply feed each document in your search corpus into the API to generate an embedding for that document, and do the same for any search query your user provides. Then for each document embedding, calculate the similarity between the query and the document and return the set of documents as your search result.
The nature of most embedding models (and efficient dense vector search algorithms) means that—if you’re not careful—the same query could produce different results between runs. It’s a black box. End users don’t understand how the results are ranked. Even ML researchers consider the process a “gray box”: we know the models work but can’t give a great description as to how or why.
When academics want to use a search as part of a systematic review or meta-analysis, they need to be able to explain how they got the papers they included. They need to know that it was comprehensive: they haven’t missed any important papers. And others need to be able to reproduce their work.
A generic language model can be prompted to specify all the variations on each search term, and those can be fed into a regular full-text search. This method is transparent and understandable, as the intermediate step is directly legible to humans. The downside is that it’s slow, and the search results are generally still much worse than a vector search.
Click here to read the original article published by The Elicit Blog.
Please give your feedback on this article or share a similar story for publishing by clicking here.