A number of very successful types of machine learning models have been developed in recent years, like large language models (LLMs), image classifiers, and reinforcement learning agents. But each of these algorithms is only useful for a limited range of problems. That is hardly what we want as we push forward toward the ultimate goal of developing an artificial general intelligence. Much like our own brains, these algorithms will need to be capable of handling any type of task we throw at them before that goal can be achieved.
Only time will tell what such a solution will look like, but it will probably be fundamentally different from the algorithms we use today. But to move forward with what we have available to us today, researchers and developers are increasingly creating multimodal models, like LLMs with the ability to recognize visual information, to build more comprehensive and capable artificial intelligence frameworks.
An overview of the system’s architecture (📷: P. Vasu et al.)
But just splicing things together is not going to improve the technology enough to meet our needs. Take vision language models (VLMs), for instance. To be useful for more practical applications — especially where fine details like text need to be understood — the algorithms must process higher-resolution images. But that increases the computational resources required, which in turn increases both latency and operational costs.
Apple researchers have just announced the release of a new algorithm called FastVLM, which attempts to achieve an optimized trade-off between latency, model size, and accuracy. The result is a VLM that can process high-resolution images, yet is capable of running with minimal computational resources. FastVLM can even run at high speeds on mobile devices like smartphones.
In particular, FastVLM tackles the inefficient processing of high-resolution images by popular vision encoders like Vision Transformers (ViTs). ViTs break an image into many small tokens and then apply stacked self-attention layers, which quickly becomes computationally expensive at larger resolutions. This bottleneck makes it difficult to deploy VLMs for real-world, latency-sensitive applications.
FastVLM reduces latency (📷: P. Vasu et al.)
To overcome this, the team introduced a new hybrid vision encoder called FastViTHD. This encoder combines convolutional and transformer-based approaches to drastically reduce the number of visual tokens generated, while also slashing the encoding time. Unlike other techniques that rely on token pruning or image tiling, FastVLM achieves this efficiency by smartly scaling the input image resolution and adapting its processing pipeline accordingly.
Performance benchmarks show impressive results. FastVLM achieves a 3.2x improvement in time-to-first-token compared to previous models in similar setups. When compared specifically to models like LLaVA-OneVision operating at high resolutions (e.g., 1152×1152), FastVLM matches their accuracy on critical benchmarks such as SeedBench and MMMU while being 85 times faster and using a vision encoder that is 3.4 times smaller.
In an era where deploying AI models on mobile and edge devices is increasingly important, FastVLM offers a compelling look at what is possible when efficiency and accuracy are designed into the algorithm from the ground up. It signals a promising direction for the future of multimodal AI — one where smarter architectures enable broader capabilities without compromising on performance or accessibility.