Leveraging Flower and Docker for Device Heterogeneity Management in Federated Learning#
Introduction#
In this example, we tackle device heterogeneity in federated learning, arising from differences in memory and CPU capabilities across devices. This diversity affects training efficiency and inclusivity. Our strategy includes simulating this heterogeneity by setting CPU and memory limits in a Docker setup, using a custom Docker compose generator script. This approach creates a varied training environment and enables us to develop strategies to manage these disparities effectively.
Handling Device Heterogeneity#
System Metrics Access:
Effective management of device heterogeneity begins with monitoring system metrics of each container. We integrate the following services to achieve this:
Cadvisor: Collects comprehensive metrics from each Docker container.
Prometheus: Using
prometheus.yaml
for configuration, it scrapes data from Cadvisor at scheduled intervals, serving as a robust time-series database. Users can access the Prometheus UI athttp://localhost:9090
to create and run queries using PromQL, allowing for detailed insight into container performance.
Mitigating Heterogeneity:
In this basic use case, we address device heterogeneity by establishing rules tailored to each container’s system capabilities. This involves modifying training parameters, such as batch sizes and learning rates, based on each device’s memory capacity and CPU availability. These settings are specified in the
client_configs
array in thecreate_docker_compose
script. For example:client_configs = [ {"mem_limit": "3g", "batch_size": 32, "cpus": 4, "learning_rate": 0.001}, {"mem_limit": "6g", "batch_size": 256, "cpus": 1, "learning_rate": 0.05}, {"mem_limit": "4g", "batch_size": 64, "cpus": 3, "learning_rate": 0.02}, {"mem_limit": "5g", "batch_size": 128, "cpus": 2.5, "learning_rate": 0.09}, ]
Prerequisites#
Docker must be installed and the Docker daemon running on your server. If you don’t already have Docker installed, you can get installation instructions for your specific Linux distribution or macOS from Docker. Besides Docker, the only extra requirement is having Python installed. You don’t need to create a new environment for this example since all dependencies will be installed inside Docker containers automatically.
Running the Example#
Running this example is easy. For a more detailed step-by-step guide, including more useful material, refer to the detailed guide in the following section.
# Generate docker compose file
python helpers/generate_docker_compose.py # by default will configure to use 2 clients for 100 rounds
# Build docker images
docker-compose build
# Launch everything
docker-compose up
On your favourite browser, go to http://localhost:3000
to see the Graphana dashboard showing system-level and application-level metrics.
To stop all containers, open a new terminal and cd
into this directory, then run docker-compose down
. Alternatively, you can do ctrl+c
on the same terminal and then run docker-compose down
to ensure everything is terminated.
Running the Example (detailed)#
Step 1: Configure Docker Compose#
Execute the following command to run the helpers/generate_docker_compose.py
script. This script creates the docker-compose configuration needed to set up the environment.
python helpers/generate_docker_compose.py
Within the script, specify the number of clients (total_clients
) and resource limitations for each client in the client_configs
array. You can adjust the number of rounds by passing --num_rounds
to the above command.
Step 2: Build and Launch Containers#
Execute Initialization Script:
To build the Docker images and start the containers, use the following command:
# this is the only command you need to execute to run the entire example docker-compose up
If you make any changes to the Dockerfile or other configuration files, you should rebuild the images to reflect these changes. This can be done by adding the
--build
flag to the command:docker-compose up --build
The
--build
flag instructs Docker Compose to rebuild the images before starting the containers, ensuring that any code or configuration changes are included.To stop all services, you have two options:
Run
docker-compose down
in another terminal if you are in the same directory. This command will stop and remove the containers, networks, and volumes created bydocker-compose up
.Press
Ctrl+C
once in the terminal wheredocker-compose up
is running. This will stop the containers but won’t remove them or the networks and volumes they use.
Services Startup:
Several services will automatically launch as defined in your
docker-compose.yml
file:Monitoring Services: Prometheus for metrics collection, Cadvisor for container monitoring, and Grafana for data visualization.
Flower Federated Learning Environment: The Flower server and client containers are initialized and start running.
After launching the services, verify that all Docker containers are running correctly by executing the
docker ps
command. Here’s an example output:➜ ~ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 9f05820eba45 flower-via-docker-compose-client2 "python client.py --…" 50 seconds ago Up 48 seconds 0.0.0.0:6002->6002/tcp client2 a0333715d504 flower-via-docker-compose-client1 "python client.py --…" 50 seconds ago Up 48 seconds 0.0.0.0:6001->6001/tcp client1 0da2bf735965 flower-via-docker-compose-server "python server.py --…" 50 seconds ago Up 48 seconds 0.0.0.0:6000->6000/tcp, 0.0.0.0:8000->8000/tcp, 0.0.0.0:8265->8265/tcp server c57ef50657ae grafana/grafana:latest "/run.sh --config=/e…" 50 seconds ago Up 49 seconds 0.0.0.0:3000->3000/tcp grafana 4f274c2083dc prom/prometheus:latest "/bin/prometheus --c…" 50 seconds ago Up 49 seconds 0.0.0.0:9090->9090/tcp prometheus e9f4c9644a1c gcr.io/cadvisor/cadvisor:v0.47.0 "/usr/bin/cadvisor -…" 50 seconds ago Up 49 seconds 0.0.0.0:8080->8080/tcp cadvisor
To monitor the resource utilization of your containers in real-time and see the limits imposed in the Docker Compose file, you can use the
docker stats
command. This command provides a live stream of container CPU, memory, and network usage statistics.âžś ~ docker stats CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 9f05820eba45 client2 104.44% 1.968GiB / 6GiB 32.80% 148MB / 3.22MB 0B / 284MB 82 a0333715d504 client1 184.69% 1.498GiB / 3GiB 49.92% 149MB / 2.81MB 1.37MB / 284MB 82 0da2bf735965 server 0.12% 218.5MiB / 15.61GiB 1.37% 1.47MB / 2.89MB 2.56MB / 2.81MB 45 c57ef50657ae grafana 0.24% 96.19MiB / 400MiB 24.05% 18.9kB / 3.79kB 77.8kB / 152kB 20 4f274c2083dc prometheus 1.14% 52.73MiB / 500MiB 10.55% 6.79MB / 211kB 1.02MB / 1.31MB 15 e9f4c9644a1c cadvisor 7.31% 32.14MiB / 500MiB 6.43% 139kB / 6.66MB 500kB / 0B 18
Automated Grafana Configuration:
Grafana is configured to load pre-defined data sources and dashboards for immediate monitoring, facilitated by provisioning files. The provisioning files include
prometheus-datasource.yml
for data sources, located in the./config/provisioning/datasources
directory, anddashboard_index.json
for dashboards, in the./config/provisioning/dashboards
directory. Thegrafana.ini
file is also tailored to enhance user experience:Admin Credentials: We provide default admin credentials in the
grafana.ini
configuration, which simplifies access by eliminating the need for users to go through the initial login process.Default Dashboard Path: A default dashboard path is set in
grafana.ini
to ensure that the dashboard with all the necessary panels is rendered when Grafana is accessed.
These files and settings are directly mounted into the Grafana container via Docker Compose volume mappings. This setup guarantees that upon startup, Grafana is pre-configured for monitoring, requiring no additional manual setup.
Begin Training Process:
The federated learning training automatically begins once all client containers are successfully connected to the Flower server. This synchronizes the learning process across all participating clients.
By following these steps, you will have a fully functional federated learning environment with device heterogeneity and monitoring capabilities.
Model Training and Dataset Integration#
Data Pipeline with FLWR-Datasets#
We have integrated flwr-datasets
into our data pipeline, which is managed within the load_data.py
file in the helpers/
directory. This script facilitates standardized access to datasets across the federated network and incorporates a data_sampling_percentage
argument. This argument allows users to specify the percentage of the dataset to be used for training and evaluation, accommodating devices with lower memory capabilities to prevent Out-of-Memory (OOM) errors.
Model Selection and Dataset#
For the federated learning system, we have selected the MobileNet model due to its efficiency in image classification tasks. The model is trained and evaluated on the CIFAR-10 dataset. The combination of MobileNet and CIFAR-10 is ideal for demonstrating the capabilities of our federated learning solution in a heterogeneous device environment.
MobileNet: A streamlined architecture for mobile and embedded devices that balances performance and computational cost.
CIFAR-10 Dataset: A standard benchmark dataset for image classification, containing various object classes that pose a comprehensive challenge for the learning model.
By integrating these components, our framework is well-prepared to handle the intricacies of training over a distributed network with varying device capabilities and data availability.
Visualizing with Grafana#
Access Grafana Dashboard#
Visit http://localhost:3000
to enter Grafana. The automated setup ensures that you’re greeted with a series of pre-configured dashboards, including the default screen with a comprehensive set of graphs. These dashboards are ready for immediate monitoring and can be customized to suit your specific requirements.
Dashboard Configuration#
The dashboard_index.json
file, located in the ./config/provisioning/dashboards
directory, serves as the backbone of our Grafana dashboard’s configuration. It defines the structure and settings of the dashboard panels, which are rendered when you access Grafana. This JSON file contains the specifications for various panels such as model accuracy, CPU usage, memory utilization, and network traffic. Each panel’s configuration includes the data source, queries, visualization type, and other display settings like thresholds and colors.
For instance, in our project setup, the dashboard_index.json
configures a panel to display the model’s accuracy over time using a time-series graph, and another panel to show the CPU usage across clients using a graph that plots data points as they are received. This file is fundamental for creating a customized and informative dashboard that provides a snapshot of the federated learning system’s health and performance metrics.
By modifying the dashboard_index.json
file, users can tailor the Grafana dashboard to include additional metrics or change the appearance and behavior of existing panels to better fit their monitoring requirements.
Grafana Default Dashboard#
Below is the default Grafana dashboard that users will see upon accessing Grafana:
This comprehensive dashboard provides insights into various system metrics across client-server containers. It includes visualizations such as:
Application Metrics: The “Model Accuracy” graph shows an upward trend as rounds of training progress, which is a positive indicator of the model learning and improving over time. Conversely, the “Model Loss” graph trends downward, suggesting that the model is becoming more precise and making fewer mistakes as it trains.
CPU Usage: The sharp spikes in the red graph, representing “client1”, indicate peak CPU usage, which is considerably higher than that of “client2” (blue graph). This difference is due to “client1” being allocated more computing resources (up to 4 CPU cores) compared to “client2”, which is limited to just 1 CPU core, hence the more subdued CPU usage pattern.
Memory Utilization: Both clients are allocated a similar amount of memory, reflected in the nearly same lines for memory usage. This uniform allocation allows for a straightforward comparison of how each client manages memory under similar conditions.
Network Traffic: Monitor incoming and outgoing network traffic to each client, which is crucial for understanding data exchange volumes during federated learning cycles.
Together, these metrics paint a detailed picture of the federated learning operation, showcasing resource usage and model performance. Such insights are invaluable for system optimization, ensuring balanced load distribution and efficient model training.
Comprehensive Monitoring System Integration#
Capturing Container Metrics with cAdvisor#
cAdvisor is seamlessly integrated into our monitoring setup to capture a variety of system and container metrics, such as CPU, memory, and network usage. These metrics are vital for analyzing the performance and resource consumption of the containers in the federated learning environment.
Custom Metrics: Setup and Monitoring via Prometheus#
In addition to the standard metrics captured by cAdvisor, we have implemented a process to track custom metrics like model’s accuracy and loss within Grafana, using Prometheus as the backbone for metric collection.
Prometheus Client Installation:
We began by installing the
prometheus_client
library in our Python environment, enabling us to define and expose custom metrics that Prometheus can scrape.
Defining Metrics in Server Script:
Within our
server.py
script, we have established two key Prometheus Gauge metrics, specifically tailored for monitoring our federated learning model:model_accuracy
andmodel_loss
. These custom gauges are instrumental in capturing the most recent values of the model’s accuracy and loss, which are essential metrics for evaluating the model’s performance. The gauges are defined as follows:from prometheus_client import Gauge accuracy_gauge = Gauge('model_accuracy', 'Current accuracy of the global model') loss_gauge = Gauge('model_loss', 'Current loss of the global model')
Exposing Metrics via HTTP Endpoint:
We leveraged the
start_http_server
function from theprometheus_client
library to launch an HTTP server on port 8000. This server provides the/metrics
endpoint, where the custom metrics are accessible for Prometheus scraping. The function is called at the end of themain
method inserver.py
:start_http_server(8000)
Updating Metrics Recording Strategy:
The core of our metrics tracking lies in the
strategy.py
file, particularly within theaggregate_evaluate
method. This method is crucial as it’s where the federated learning model’s accuracy and loss values are computed after each round of training with the aggregated data from all clients.self.accuracy_gauge.set(accuracy_aggregated) self.loss_gauge.set(loss_aggregated)
Configuring Prometheus Scraping:
In the
prometheus.yml
file, underscrape_configs
, we configured a new job to scrape the custom metrics from the HTTP server. This setup includes the job’s name, the scraping interval, and the target server’s URL.
Visualizing the Monitoring Architecture#
The image below depicts the Prometheus scraping process as it is configured in our monitoring setup. Within this architecture:
The “Prometheus server” is the central component that retrieves and stores metrics.
“cAdvisor” and the “HTTP server” we set up to expose our custom metrics are represented as “Prometheus targets” in the diagram. cAdvisor captures container metrics, while the HTTP server serves our custom
model_accuracy
andmodel_loss
metrics at the/metrics
endpoint.These targets are periodically scraped by the Prometheus server, aggregating data from both system-level and custom performance metrics.
The aggregated data is then made available to the “Prometheus web UI” and “Grafana,” as shown, enabling detailed visualization and analysis through the Grafana dashboard.
By incorporating these steps, we have enriched our monitoring capabilities to not only include system-level metrics but also critical performance indicators of our federated learning model. This approach is pivotal for understanding and improving the learning process. Similarly, you can apply this methodology to track any other metric that you find interesting or relevant to your specific needs. This flexibility allows for a comprehensive and customized monitoring environment, tailored to the unique aspects and requirements of your federated learning system.
Additional Resources#
Grafana Tutorials: Explore a variety of tutorials on Grafana at Grafana Tutorials.
Prometheus Overview: Learn more about Prometheus at their official documentation.
cAdvisor Guide: For information on monitoring Docker containers with cAdvisor, see this Prometheus guide.
Conclusion#
This project serves as a foundational example of managing device heterogeneity within the federated learning context, employing the Flower framework alongside Docker, Prometheus, and Grafana. It’s designed to be a starting point for users to explore and further adapt to the complexities of device heterogeneity in federated learning environments.