DepthFL: Depthwise Federated Learning for Heterogeneous Clients#

Note: If you use this baseline in your work, please remember to cite the original authors of the paper as well as the Flower paper.

Paper: openreview.net/forum?id=pf8RIZTMU58

Authors: Minjae Kim, Sangyoon Yu, Suhyun Kim, Soo-Mook Moon

Abstract: Federated learning is for training a global model without collecting private local data from clients. As they repeatedly need to upload locally-updated weights or gradients instead, clients require both computation and communication resources enough to participate in learning, but in reality their resources are heterogeneous. To enable resource-constrained clients to train smaller local models, width scaling techniques have been used, which reduces the channels of a global model. Unfortunately, width scaling suffers from heterogeneity of local models when averaging them, leading to a lower accuracy than when simply excluding resource-constrained clients from training. This paper proposes a new approach based on depth scaling called DepthFL. DepthFL defines local models of different depths by pruning the deepest layers off the global model, and allocates them to clients depending on their available resources. Since many clients do not have enough resources to train deep local models, this would make deep layers partially-trained with insufficient data, unlike shallow layers that are fully trained. DepthFL alleviates this problem by mutual self-distillation of knowledge among the classifiers of various depths within a local model. Our experiments show that depth-scaled local models build a global model better than width-scaled ones, and that self-distillation is highly effective in training data-insufficient deep layers.

About this baseline#

What’s implemented: The code in this directory replicates the experiments in DepthFL: Depthwise Federated Learning for Heterogeneous Clients (Kim et al., 2023) for CIFAR100, which proposed the DepthFL algorithm. Concretely, it replicates the results for CIFAR100 dataset in Table 2, 3 and 4.

Datasets: CIFAR100 from PyTorch’s Torchvision

Hardware Setup: These experiments were run on a server with Nvidia 3090 GPUs. Any machine with 1x 8GB GPU or more would be able to run it in a reasonable amount of time. With the default settings, clients make use of 1.3GB of VRAM. Lower num_gpus in client_resources to train more clients in parallel on your GPU(s).

Contributors: Minjae Kim

Experimental Setup#

Task: Image Classification

Model: ResNet18

Dataset: This baseline only includes the CIFAR100 dataset. By default it will be partitioned into 100 clients following IID distribution. The settings are as follow:

Dataset	#classes	#partitions	partitioning method
CIFAR100	100	100	IID or Non-IID

Training Hyperparameters: The following table shows the main hyperparameters for this baseline with their default value (i.e. the value used if you run python -m depthfl.main directly)

Description	Default Value
total clients	100
local epoch	5
batch size	50
number of rounds	1000
participation ratio	10%
learning rate	0.1
learning rate decay	0.998
client resources	{‘num_cpus’: 1.0, ‘num_gpus’: 0.5 }
data partition	IID
optimizer	SGD with dynamic regularization
alpha	0.1

Environment Setup#

To construct the Python environment follow these steps:

# Set python version
pyenv install 3.10.6
pyenv local 3.10.6

# Tell poetry to use python 3.10
poetry env use 3.10.6

# Install the base Poetry environment
poetry install

# Activate the environment
poetry shell

Running the Experiments#

To run this DepthFL, first ensure you have activated your Poetry environment (execute poetry shell from this directory), then:

# this will run using the default settings in the `conf/config.yaml`
python -m depthfl.main  # 'accuracy' : accuracy of the ensemble model, 'accuracy_single' : accuracy of each classifier.

# you can override settings directly from the command line
python -m depthfl.main exclusive_learning=true model_size=1 # exclusive learning - 100% (a)
python -m depthfl.main exclusive_learning=true model_size=4 # exclusive learning - 25% (d)
python -m depthfl.main fit_config.feddyn=false fit_config.kd=false # DepthFL (FedAvg)
python -m depthfl.main fit_config.feddyn=false fit_config.kd=false fit_config.extended=false # InclusiveFL

To run using HeteroFL:

# since sbn takes too long, we test global model every 50 rounds. 
python -m depthfl.main --config-name="heterofl" # HeteroFL
python -m depthfl.main --config-name="heterofl" exclusive_learning=true model_size=1 # exclusive learning - 100% (a)

Stateful clients comment#

To implement feddyn, stateful clients that store prev_grads information are needed. Since flwr does not yet officially support stateful clients, it was implemented as a temporary measure by loading prev_grads from disk when creating a client, and then storing it again on disk after learning. Specifically, there are files that store the state of each client in the prev_grads folder. When the strategy is instantiated (for both FedDyn and HeteroFL) the content of prev_grads is reset.

Expected Results#

With the following command we run DepthFL (FedDyn / FedAvg), InclusiveFL, and HeteroFL to replicate the results of table 2,3,4 in DepthFL paper. Tables 2, 3, and 4 may contain results from the same experiment in multiple tables.

# table 2 (HeteroFL row)
python -m depthfl.main --config-name="heterofl" 
python -m depthfl.main --config-name="heterofl" --multirun exclusive_learning=true model.scale=false model_size=1,2,3,4 

# table 2 (DepthFL(FedAvg) row)
python -m depthfl.main fit_config.feddyn=false fit_config.kd=false 
python -m depthfl.main --multirun fit_config.feddyn=false fit_config.kd=false  exclusive_learning=true model_size=1,2,3,4

# table 2 (DepthFL row)
python -m depthfl.main
python -m depthfl.main --multirun exclusive_learning=true model_size=1,2,3,4

Table 2

100% (a), 75%(b), 50%(c), 25% (d) cases are exclusive learning scenario. 100% (a) exclusive learning means, the global model and every local model are equal to the smallest local model, and 100% clients participate in learning. Likewise, 25% (d) exclusive learning means, the global model and every local model are equal to the largest local model, and only 25% clients participate in learning.

Scaling Method	Dataset	Global Model	100% (a)	75% (b)	50% (c)	25% (d)
HeteroFL DepthFL (FedAvg) DepthFL	CIFAR100	57.61 72.67 76.06	64.39 67.08 69.68	66.08 70.78 73.21	62.03 68.41 70.29	51.99 59.17 60.32

# table 3 (Width Scaling - Duplicate results from table 2)
python -m depthfl.main --config-name="heterofl" 
python -m depthfl.main --config-name="heterofl" --multirun exclusive_learning=true model.scale=false model_size=1,2,3,4 

# table 3 (Depth Scaling : Exclusive Learning, DepthFL(FedAvg) rows - Duplicate results from table 2)
python -m depthfl.main fit_config.feddyn=false fit_config.kd=false 
python -m depthfl.main --multirun fit_config.feddyn=false fit_config.kd=false  exclusive_learning=true model_size=1,2,3,4

## table 3 (Depth Scaling - InclusiveFL row)
python -m depthfl.main fit_config.feddyn=false fit_config.kd=false fit_config.extended=false

Table 3

Accuracy of global sub-models compared to exclusive learning on CIFAR-100.

Method	Algorithm	Classifier 1/4	Classifier 2/4	Classifier 3/4	Classifier 4/4
Width Scaling	Exclusive Learning HeteroFL	64.39 51.08	66.08 55.89	62.03 58.29	51.99 57.61

Method	Algorithm	Classifier 1/4	Classifier 2/4	Classifier 3/4	Classifier 4/4
Depth Scaling	Exclusive Learning InclusiveFL DepthFL (FedAvg)	67.08 47.61 66.18	68.00 53.88 67.56	66.19 59.48 67.97	56.78 60.46 68.01

# table 4
python -m depthfl.main --multirun fit_config.kd=true,false dataset_config.iid=true,false

Table 4

Accuracy of the global model with/without self distillation on CIFAR-100.

Distribution	Dataset	KD	Classifier 1/4	Classifier 2/4	Classifier 3/4	Classifier 4/4	Ensemble
IID	CIFAR100	✗ ✓	70.13 71.74	69.63 73.35	68.92 73.57	68.92 73.55	74.48 76.06
non-IID	CIFAR100	✗ ✓	67.94 70.33	68.68 71.88	68.46 72.43	67.78 72.34	73.18 74.92