Panasonic HD develops “SparseVLM” technology that doubles the processing speed of Vision-Language Model

Panasonic

Panasonic R&D Company of America (PRDCA) and Panasonic Holdings Co., Ltd. (Panasonic HD), in collaboration with researchers from Peking University, Fudan University, University of California, Berkeley, and Shanghai Jiao Tong University, have developed “SparseVLM,” a technology that speeds up Vision-Language Models (VLMs), AI models that can understand and process both visual data such as images and videos, and text data.

In recent years, VLMs have seen rapid development. These models can process visual and textual information simultaneously and can answer questions about visual content. However, handling a large amount of data, especially high-resolution images and long videos, leads to longer inference times and higher computational complexity for the AI model.

This research has been accepted for presentation at the 42nd International Conference on Machine Learning (ICML2025), one of the premier conferences for AI and machine learning research. The conference will take place in Vancouver, Canada from July 13 to July 19, 2025.

Also Read: Hewlett Packard Enterprise Closes Acquisition of Juniper Networks to Offer Industry-Leading Comprehensive, Cloud-Native, AI-Driven Portfolio

Panasonic HD and PRDCA are working on developing highly efficient generative AI in collaboration with the universities that led this research. In recent years, VLMs that process visual and textual information simultaneously have attracted attention. They incorporate large language models (LLMs) to leverage the reasoning and recognition capabilities of LLMs. Since such VLMs are designed to integrate visual tokens extracted from images or videos with text tokens and input them into the LLM, the amount of information processed by the LLM increases, especially with high-resolution images and long videos. Visual tokens that are not required to generate answers must also be processed, leading to a longer inference time and higher computational complexity.

Several approaches have been proposed to speed up VLMs by focusing on the redundancy of unnecessary visual tokens. However, these existing methods typically select visual tokens solely based on images and perform sparsification without considering their relevance to the input text prompt. Consequently, these methods remain inefficient because they process visual tokens irrelevant to the prompt, leaving room for improvement.

Source: Panasonic