High-Performance VLMs Arrive on Smartphones

maio 19, 2025

6

A number of very successful types of machine learning models have been developed in recent years, like large language models (LLMs), image classifiers, and reinforcement learning agents. But each of these algorithms is only useful for a limited range of problems. That is hardly what we want as we push forward toward the ultimate goal of developing an artificial general intelligence. Much like our own brains, these algorithms will need to be capable of handling any type of task we throw at them before that goal can be achieved.

Only time will tell what such a solution will look like, but it will probably be fundamentally different from the algorithms we use today. But to move forward with what we have available to us today, researchers and developers are increasingly creating multimodal models, like LLMs with the ability to recognize visual information, to build more comprehensive and capable artificial intelligence frameworks.

An overview of the system’s architecture (📷: P. Vasu et al.)

But just splicing things together is not going to improve the technology enough to meet our needs. Take vision language models (VLMs), for instance. To be useful for more practical applications — especially where fine details like text need to be understood — the algorithms must process higher-resolution images. But that increases the computational resources required, which in turn increases both latency and operational costs.

Apple researchers have just announced the release of a new algorithm called FastVLM, which attempts to achieve an optimized trade-off between latency, model size, and accuracy. The result is a VLM that can process high-resolution images, yet is capable of running with minimal computational resources. FastVLM can even run at high speeds on mobile devices like smartphones.

In particular, FastVLM tackles the inefficient processing of high-resolution images by popular vision encoders like Vision Transformers (ViTs). ViTs break an image into many small tokens and then apply stacked self-attention layers, which quickly becomes computationally expensive at larger resolutions. This bottleneck makes it difficult to deploy VLMs for real-world, latency-sensitive applications.

FastVLM reduces latency (📷: P. Vasu et al.)

To overcome this, the team introduced a new hybrid vision encoder called FastViTHD. This encoder combines convolutional and transformer-based approaches to drastically reduce the number of visual tokens generated, while also slashing the encoding time. Unlike other techniques that rely on token pruning or image tiling, FastVLM achieves this efficiency by smartly scaling the input image resolution and adapting its processing pipeline accordingly.

Performance benchmarks show impressive results. FastVLM achieves a 3.2x improvement in time-to-first-token compared to previous models in similar setups. When compared specifically to models like LLaVA-OneVision operating at high resolutions (e.g., 1152×1152), FastVLM matches their accuracy on critical benchmarks such as SeedBench and MMMU while being 85 times faster and using a vision encoder that is 3.4 times smaller.

In an era where deploying AI models on mobile and edge devices is increasingly important, FastVLM offers a compelling look at what is possible when efficiency and accuracy are designed into the algorithm from the ground up. It signals a promising direction for the future of multimodal AI — one where smarter architectures enable broader capabilities without compromising on performance or accessibility.

Previous articleApple employees reveal why its AI failed: old school executives and a commitment to privacy

Next articlePG&E Automates BVLOS Drone Compliance with AirData

High-Performance VLMs Arrive on Smartphones

Explore Cisco IOS XE Automation at Cisco Live US 2025

5G networks to carry 80% of global mobile traffic by 2030

5GAA demos lifesaving NTN and V2X tech for connected cars

LEAVE A REPLY Cancel reply

Most Popular

How My Lovely Planet is making environmental preservation fun through games

Stratasys Acquires Key Assets of Forward AM

Great Lakes, Greater Innovation: The Midwest’s Water Tech Momentum

Access to experimental medical treatments is expanding across the US

Recent Comments

ABOUT US

POPULAR POSTS

How My Lovely Planet is making environmental preservation fun through games

Stratasys Acquires Key Assets of Forward AM

Great Lakes, Greater Innovation: The Midwest’s Water Tech Momentum

POPULAR CATEGORY