Science and Research Content

Microsoft Introduces Multi-Modal Topic Inferencing in Video Indexer -


Content categorization by topics is an intuitive and deductive approach. However, this approach does not work well with videos because topics are not explicitly apparent and hence difficult to deduct intuitively from a video. To overcome this challenge, Microsoft has introduced multi-modal topic inferencing in Video Indexer — an application that builds upon media AI technologies to simplify the extraction of insights from videos.

Multi-modality is a key ingredient for recognizing high-level concepts in videos. Consequently, the new capability— multi-modal topic inferencing—helps Microsoft’s Video Indexer intuitively index media content using a cross-channel model to automatically infer topics. The model infers topics by projecting the video concepts onto three different ontologies — IPTC, Wikipedia, and the Video Indexer hierarchical topic ontology. Let us take an in-depth look into how Microsoft’s Video Indexer leverages the three ontologies.

The Video Indexer applies two models to extract topics. The first is a deep neural network that scores and ranks the topics directly from the raw text based on a large proprietary data set. For instance, the transcript in the video is mapped with the Video Indexer Ontology and IPTC ontology. The Video Indexer Ontology is a proprietary hierarchical ontology with over 20,000 entries and a maximum depth of three layers. The hierarchically structured IPTC ontology is popular among media companies because it uses inferred topics (implicit concepts) using knowledge graphs to detect related concept clusters. The model uses transcription (spoken words), optical character recognition (OCR) content (visual text), and celebrities recognized in the video using the Video Indexer facial recognition model. The three signals together capture the video concepts from different angles similar to what human beings do when they watch a video.

The second model applies spectral graph algorithms on the named entities mentioned in the video. The algorithm takes input signals from the Wikipedia Ontology, which has 1.7 million well-edited categories. The signals are largely the IDs of the celebrities recognized in the video. To extract the entities mentioned in the text, entity linking intelligent service (ELIS) is used. Subsequently, the structured data is used to obtain the topics.

Once the named entities are identified, a graph based on the similarity of the entities in the Wikipedia pages is built. Further, the entities are clustered to capture the different concepts within the video. Finally, two examples per cluster are selected to rank the categories in the Wikipedia pages according to their posterior probability to be a good topic. With the help of the Wikipedia knowledge graph, the Video Indexer understands the relations within media files and provides an accurate and efficient solution. Click here to read the article.

STORY TOOLS

  • |
  • |

Please give your feedback on this article or share a similar story for publishing by clicking here.


sponsor links

For banner ads click here