Quizlet’s community-curated catalog of study sets — 300 million and growing — covers a wide range of academic subjects. On one hand, such a large and varied content catalog empowers Quizlet users to master any subject under the sun. On the other hand, it creates interesting information retrieval problems. Dustin Stansbury, a data scientist at Quizlet, describes how the support team at Quizlet organized the content into intuitive academic subjects using hierarchical subject taxonomy. In addition, he explains how the hierarchical structure was used to assign subject labels to every study set.
The first step was to define the organizational structure of the academic subjects that make up Quizlet’s study set catalog. After some research, the support team at Quizlet settled on organizing the academic subjects into a hierarchical taxonomy tree. This structure was chosen for its intuitiveness and for the correspondence with graph data structures that can be used for efficient search and information retrieval.
In order to define the specific subjects that comprise Quizlet’s taxonomy graph, definitions were compiled from various online resources such as the Wikipedia. In addition, the content in Quizlet was mined to identify topics that often occur on the platform, but may not be present in the standard definitions.
The next step was to label all Quizlet study sets with the most relevant subject label, given the content of the set. Having subject taxonomy, the support team used machine learning to build an automatic labeling system. This approach requires training a computer algorithm to learn patterns like “study sets that contain the word mitochondria are often assigned the subject label Science”. The support team at Quizlet used both unstructured and structured prediction to train the model sets.
For unstructured prediction, a flat strategy was employed, as it is intuitive and simple. Further, the inherently hierarchical nature of the taxonomy information about any parent nodes could be obtained free. For the structured prediction approach, the hierarchical structure of the subject taxonomy was used to break the general classification task into multiple sub-problems. In addition, the hierarchical structure helped to dedicate each sub-problem to the local subsections of the taxonomy tree.
Finally, the support team used the framework to train models that can predict a study sets label within the hierarchical subject taxonomy, and quantify how well a given model will meet our application needs in the real world.
Click here to read the article.Please give your feedback on this article or share a similar story for publishing by clicking here.