After several years in the making, the new Raspberry Pi 5 is out. It is at least twice as fast than its predecessor, making this single board computer suitable for some serious on-device ML workloads. It's $80 in its 8GB RAM configuration. We benchmarked the training and finetuning of 50 popular vision models to directly asses the readiness of the Raspberry Pi 5 for Federated Learning workloads. In Federated Learning, participating devices that observe the world through cameras and other sensors can make use of that data for training and improve their ML model. This training (or finetuning) must take place on-device, motivating the use of versatile platforms such as Raspberry Pi boards. Is the Raspberry Pi 5 ready for Federated Vision?
Robot vacuum cleaners, we all love them. Sort of. The truth is that they aren't very smart: they bump into furniture, they get stuck at the boundaries of different surfaces, they fail at low-level vision tasks. To make matters worse, they make the same errors again and again since they often do not learn from their experiences. The best way to ship a world-class vision system for robot vacuum cleaners, is training it through Federated Learning: one model collaboratively trained by thousands of vacuum cleaners, each using their own data. Try it with Flower.
We begin the benchmark (code to reproduce the results here: federated-embedded-vision) by selecting several vision models from both the CNN and Transformers eras. We can clearly see that the new Raspberry Pi 5 is consistently 2x faster than its predecessor. The results report the seconds/batch to train (i.e. no frozen parameters) each architecture on 224x224 images (the standard ImageNet size) and batch size 16. Lower is better. To add more context to the results shown below: a FasterViT 1, a 53M parameter model which achieves over 83% Top-1 on ImageNet can be trained on 1k images in approximately 12 minutes on the Raspberry Pi 5. Taking into account the scales at which cross-device FL operates (easily involving hundreds or thousands of devices in parallel, Kairouz et al 2019), this rate of training is reasonable.
Drones, or UAVs more generically, are often equipped with one or more cameras, and other sensors. The use cases for ML-powered applications running on-device are many: following you while hiking, helping farmers to monitor their crops, tracking animal behaviour in wildlife reservers, monitor city traffic and dynamics, etc. Curating a large dataset for each use case is not feasible. The solution ? Do federated self-supervised learning on images/videos (see how Rehman et al, ECCV'22 did it) with each device using its own data. This on-device training can take place when the drone is docked/recharging. Afterwards, a variety of downstream task like those previously mentioned could be designed for which curating a dataset becomes so much more feasible.
We also benchmarked many more models on the new Raspberry Pi 5 than the dozen shown earlier. We kept the same experimental protocol. Below we present the results in a similar format to a popular plotting style comparing vision architectures showing ImageNet top-1 acc vs seconds/batch (or SPB for short) on the x-axis instead. When constructing this chart we extracted accuracy and model metrics from either huggingface/timm or the papers where these model architecture were proposed. Expectedly, most models in the bottom left region of the scatter plot (i.e. < 77% Top-1 acc and < 15 SPB) correspond to architectures relying on convolutional layers. If you hover over the plot, you'll see all the statistics associated to each model. FasterViT_1 and FasterViT_2 seem to offer the best tradeoff between accuracy and training costs.
For a future where we coexist with autonomous robots, their presence should come unnoticed for the most part. This means autonomous navigation becomes a solved problem. How are going to achieve this perfect autonomy under all circumstances (e.g lighting conditions, environments) ?. How are we going to do so without centralizing terabytes of data containing sensible information (e.g. faces, recordings of private spaces) ? Solution: take the training to where the data is. Make sense out of all the footage robots need to process during navigation to continue training the model. This challenge can be solved with Federated Learning.
Sometimes we might consider to start FL from a pre-trained model (e.g. on ImageNet-22K) and freeze some of its parameters. The following results are obtained after benchmarking the same 50 models as above, under the exact same conditions but freezing all parameters excepts those in the classification head. Because some models have more complex classifiers than others, doing on-device finetune doesn't result in a perfectly even reduction along the seconds/batch axis. Doing on-device finetuning on the Raspberry Pi 5 is between 1.5x and 3.4x than training the entire model. Naturally, this brings other benefits in the form of lower memory peaks and power consumption. We leave that analysis for a followup post.
We hope this blogpost inspired you to run your very own Federated Learning workloads for vision models. Start by getting a RaspberryPi 5 and while you wait for it to be delivered, checkout out the federated-embedded-vision repository run FL with any of the models in presented in the results above. There you'll also find the steps to reproduce all the results in this blogpost.