With tens of billions of devices currently in operation, it is staggering to think how much data Internet of Things (IoT) hardware is collecting day in and day out. These systems carry out just about any task that you can conceive of, ranging from monitoring agricultural operations to tracking wildlife and smart city infrastructure management. It is common for IoT sensors to be organized into very large, distributed networks with many thousands of nodes. All of that data needs to be analyzed to make sense of it, so it is transmitted to powerful cloud computing systems in most cases.
This arrangement works reasonably well, but it is not the ideal solution. Centralized processing comes with some downsides, like high hardware, energy, and communications costs. Remote processing also introduces latency into the system, which hinders the development of real-time applications. For reasons such as these, a much better solution would be to run the processing algorithms directly on the IoT hardware, right at the point where it is being collected (or at least very near to that location on edge hardware).
A high-level overview of the proposed system (📷: E. Mensah et al.)
Of course this is not as easy as flipping a switch. The algorithms are often very computationally expensive, which is why the work is being offloaded in the first place. The tiny microcontrollers and nearby low-power edge devices simply do not have the resources needed to handle these big jobs. Engineers at the University of Washington have developed a new algorithm that they believe could help us to make the shift toward processing sensor data at or near the point of collection, however. Their novel approach was designed to make deep learning — even multi-modal models — more efficient, reliable, and usable for high-resolution ecological monitoring and other edge-based applications.
The system’s architecture builds on the MobileViTV2 model, enhanced with Mixture of Experts (MoE) transformer blocks to optimize computational efficiency while maintaining high performance. The integration of MoE allows the model to selectively route different data patches to specialized computational “experts,” enabling sparse, conditional computation. To enhance adaptability, the routing mechanism uses clustering techniques, such as Agglomerative Hierarchical Clustering, to initialize expert selection based on patterns in the data. This clustering ensures that patches with similar features are processed efficiently while maintaining high accuracy.
Training stability was another key consideration, as MoE routing can be challenging with smaller datasets or diverse inputs. The model addresses this through pre-training optimizations, such as initializing the router with centroids derived from representative data patches. These centroids are refined iteratively using an efficient algorithm that selects the most relevant features, ensuring computational feasibility and improved routing precision. The architecture also incorporates lightweight adjustments to the Multi-Layer Perceptron modules within the experts, including low-rank factorization and correction terms, to balance efficiency and accuracy.
Sample expert groupings from the final transformer layer (📷: E. Mensah et al.)
To evaluate the system, its ability to perform fine-grained bird species classifications was tested. The training process began by pre-training the MobileViTV2-0.5 model on the iNaturalist ’21 birds dataset. During this process, the final classification head was replaced with a randomly initialized 60-class output layer. That enabled the model to learn general features of bird species before being fine-tuned with the MoE setup for the specific task of species discrimination.
The evaluation demonstrated that the MoE-enhanced model maintained semantic class groupings during fine-tuning and achieved promising results despite a reduced parameter count. Expert routing, particularly at the final transformer layer, was shown to effectively handle patches, minimizing compute and memory requirements. However, performance scaling was limited by the small amount of training data, indicating the need for larger datasets or enhanced strategies for handling sparse data. Experiments revealed that while increasing batch size without corresponding data scaling reduced generalization, routing techniques and modifications to mitigate background effects could improve accuracy.
The evaluation highlighted the potential of this approach to deliver computational efficiency and adaptability in edge machine learning tasks. Accordingly, these algorithms could be deployed on resource-constrained devices like Raspberry Pis or even mobile platforms powered by solar energy in the future.