Keyboard based approaches employed by traditional search engines lack the retrieval precision needed to identify relevant evidence from the corpus of COVID-19 scientific literature. To overcome this challenge, Google has launched the COVID-19 Research Explorer, an interface on top of the COVID-19 Open Research Dataset (CORD-19).
Keyboard based approaches employed by traditional search engines lack the retrieval precision needed to identify relevant evidence from the corpus of COVID-19 scientific literature. To overcome this challenge, Google has launched the COVID-19 Research Explorer, an interface on top of the COVID-19 Open Research Dataset (CORD-19).
COVID-19 Research Explorer leverages semantic search. Semantic search models aim to learn relationships implicitly and strive to model word order and semantics at their core. The COVID-19 Research Explorer piggybacks on this capability to capture term overlap between a query and a document and understand whether the meaning of a phrase is relevant to the user’s true intent behind their query.
The semantic search technology that powers the COVID-19 Research Explorer is Bidirectional Encoder Representations from Transformers (BERT), which was recently deployed to improve retrieval precision of Google Search. However, biomedical literature uses a language that is very different from the kinds of queries submitted to Google.com. Therefore, to train BERT models, a large synthetic corpus of questions and relevant documents in the biomedical domain was built.
Specifically, large amounts of general domain question-answer pairs were used to train an encoder-decoder model. The neural architecture was trained to translate from answer passages to questions (or queries) about that passage. Next, using passages from every document in CORD-19, corresponding queries were generated. Subsequently, these synthetic query-passage pairs were used as supervision to train the neural retrieval model.
However, neural retrieval models learn generalizations, sometimes over generalize concepts and meaning, and try to match based on the generalizations. Therefore, a hybrid term-neural retrieval model, which can encode both the query and documents and then treat retrieval as looking for the document vectors that are most similar to the query vector, was encoded. To make this work at scale, an approach that combined the vectors with a trade-off parameter was taken.
The COVID-19 Research Explorer is freely available to the research community as an open alpha with several usability enhancements scheduled soon.
Click here to read the original article published by the Google AI Blog.
Please give your feedback on this article or share a similar story for publishing by clicking here.