Science and Research Content

MIT Researchers Introduce A Novel Lightweight Multi-Scale Attention for on-Device Semantic Segmentation -


The goal of semantic segmentation, a fundamental problem in computer vision, is to classify each pixel in the input image with a certain class. Autonomous driving, medical image processing, computational photography, etc., are just a few real-world contexts where semantic segmentation can be useful. Semantic segmentation is an example of a dense prediction task that necessitates high-resolution images and robust context information extraction capability. Therefore, transferring the effective model architecture used in image classification and applying it to semantic segmentation is inappropriate.

When asked to classify the millions of individual pixels in a high-resolution image, machine learning models face a formidable challenge. Recently, a highly effective use of a novel sort of model called a vision transformer has emerged. The original intent of transformers was to improve the efficiency of NLP for languages. In such a setting, they tokenize the words in a sentence and create a network diagram that displays how those words are connected.

The attention map enhances the model’s ability to comprehend context. To generate an attention map, a vision transformer uses the same idea, slicing an image into patches of pixels and encoding each little patch into a token. The model employs a similarity function that learns the direct interaction between every pair of pixels to generate this attention map. By doing so, the model creates a “global receptive field,” allowing it to perceive all the important details in the image. The attention map soon grows very large since a high-resolution image may include millions of pixels divided into thousands of patches. As a result, the computation required to process an image with increasing resolution climbs at a quadratic rate.

The MIT team replaced the nonlinear similarity function with a linear one to simplify the method used to construct the attention map in their new model series, dubbed EfficientViT. Because of this, the order in which operations are performed can be changed to reduce the number of calculations required without compromising functionality or the global receptive field, and with their approach, the amount of processing time needed to make a forecast scales linearly with the pixel count of the input image.

New models in the EfficientViT family do semantic segmentation locally on the device. EfficientViT is built around a novel lightweight multi-scale attention module for hardware-efficient global receptive field and multi-scale learning. The module was created to provide access to these two essential functionalities while minimizing the need for inefficient hardware operations.

Click here to read the original article published by Marktechpost Media.

STORY TOOLS

  • |
  • |

Please give your feedback on this article or share a similar story for publishing by clicking here.


sponsor links

For banner ads click here