FedStar: Federated Self-training for Semi-supervised Audio Recognition#

Note: If you use this baseline in your work, please remember to cite the original authors of the paper as well as the Flower paper.

Paper: dl.acm.org/doi/10.1145/3520128

Authors: Vasileios Tsouvalas, Aaqib Saeed, Tanir Özcelebi

Abstract: Federated Learning is a distributed machine learning paradigm dealing with decentralized and personal datasets. Since data reside on devices such as smartphones and virtual assistants, labeling is entrusted to the clients or labels are extracted in an automated way. Specifically, in the case of audio data, acquiring semantic annotations can be prohibitively expensive and time-consuming. As a result, an abundance of audio data remains unlabeled and unexploited on users’ devices. Most existing federated learning approaches focus on supervised learning without harnessing the unlabeled data. In this work, we study the problem of semi-supervised learning of audio models via self-training in conjunction with federated learning. We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models. We further demonstrate that self-supervised pre-trained models can accelerate the training of on-device models, significantly improving convergence within fewer training rounds. We conduct experiments on diverse public audio classification datasets and investigate the performance of our models under varying percentages of labeled and unlabeled data. Notably, we show that with as little as 3% labeled data available, FedSTAR on average can improve the recognition rate by 13.28% compared to the fully supervised federated model.

About this baseline#

What’s implemented: The code is structured in such a way that all experiments for ambient context and speech commands can be derived.

Datasets: Ambient Context, Speech Commands

Hardware Setup: These experiments were run on a linux server with 56 CPU threads with 325 GB Ram with A10 GPU in it. Any machine with 16 CPU cores and 32 GB memory would be able to run experiments with small number of clients in a reasonable amount of time. For context, a machine with 24 cores and a RTX3090Ti ran the Speech Commands experiment in Table 3 with 10 clients in 1h. For this experiment 30GB of RAM was used and clients required ~1.4GB of VRAM each. The same experiment but with the Ambient Context dataset too 13minutes.

Contributors: Raj Parekh GitHub, Mail

Environment Setup#

# Set python version
pyenv local 3.10.6
# Tell poetry to use python 3.10
poetry env use 3.10.6
# Now install the environment
poetry install
# Start the shell to activate your environment.
poetry shell

Next, you’ll need to download the datasets. In the case of SpeechCommands, some preprocessing is also required:

# Make the shell script executable
chmod +x setup_datasets.sh

# The below script will download the datasets and create a directory structure require to run this experiment.
./setup_datasets.sh

# If you want to run the SpeechCommands experiment, pre-process the dataset
# This will generate a few training example from the _silence_ category
python -m fedstar.dataset_preparation
# Please note the above will make following changes:
#    * Add new files to datasets/speech_commands/Data/Train/_silence_
#    * Add new entries to data_splits/speech_commands/train_split.txt
# Therefore the above command should only be run once. If you want to run it again
# after making modifications to the script, please either revert the changes outlined
# above or erase the dataset and repeat the download + preprocessing as defined in setup_datasets.sh script.

Setting up GPU Memory#

Note: The experiment is designed to run on both GPU and CPU, but runs better on a system with GPU (specially when using the SpeechCommands dataset). If you wish to use GPU, make sure you have installed the CUDA Toolkit. This baseline has been tested with CUDA 12.3. By default, it will run only on the CPU. Please update the value of the list gpu_total_mem with the corresponding memory for each GPU in your machine that you want to expose to the experiment. The variable is in the distribute_gpus function inside the clients.py. Reference is shown below.

# For Eg:- We have a system with two GPUs with 8GB and 4GB VRAM.
#          The modified variable will looks like below.
gpu_free_mem = [8000,4000]

Running the Experiments#

By default, the Ambient Context experiment in Table 3 with 10 clients will be run.

python -m fedstar.server
python -m fedstar.clients

You can change the dataset, number of clients and number of rounds like this:

python -m fedstar.server num_clients=5 dataset_name=speech_commands server.rounds=20
python -m fedstar.clients num_clients=5 dataset_name=speech_commands

To run experiments for Table 4, you should pass a different config file (i.e. that in fedstar/conf/table4.yaml). You can do this as follows:

# by default will run FedStar with Ambient Context and L=3%
python -m fedstar.server --config-name table4
python -m fedstar.clients --config-name table4

To modify the ratio of labelled data do so as follows:

# To use a different L setting
python -m fedstar.server --config-name table4 L=L5 # {L3, L5, L20, L50}
# same for fedstar.clients

To run in supervised mode, pass fedstar=false to any of the commands above (when launching both the server and clients). Naturally, you can also override any other setting, like dataset_name and num_clients if desired.

Expected Results#

This section indicates the commands to execute to obtain the results shown below in Table 3 and Table 4. While both configs fix the number of rounds to 100, in many settings fewer rounds are enough for the model to reach the accuracy shown in the tables. The commands below make use of Hydra’s --multirun to run multiple experiments. This is better suited when using Flower simulations. Here they work fine but, if you encounter any issues, you can always “unroll” the multirun and run one configuration at a time. If you do this, results won’t go into the multirun/ directory, instead to the default outputs/ directory.

Table 3#

Results will be stored in multirun/Table3/<dataset_name>/N_<num_clients>/<date>/<time>. Please note since we are running two Hydra processes, both server and client will generate a log and therefore respective subdirectories in multirun/. This is a small compromise of not using Flower simulation.

# For Ambient Context
python -m fedstar.server --multirun num_clients=5,10,15,30
python -m fedstar.clients --multirun num_clients=5,10,15,30

# For SpeechCommands
python -m fedstar.server --multirun num_clients=5,10,15,30 dataset_name=speech_commands
python -m fedstar.clients --multirun num_clients=5,10,15,30 dataset_name=speech_commands

Clients	Speech Commands		Ambient Context
	Actual	Implementation	Actual	Implementation
N=5	96.93	97.15	71.88	72.60
N=10	96.78	96.42	68.01	68.43
N=15	96.33	96.43	66.86	66.28
N=30	94.62	95.37	65.14	59.45

Table 4#

Following the logic presented for obtaining Table 3 results, the larger Table 4 set of results can be obtained by running the --multirun commands shown below.

# Generate supervised results for Ambient Context (note this will run 4x4=16 experiments)
python -m fedstar.server --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50 fedstar=false
python -m fedstar.clients --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50 fedstar=false

# Generate supervised results for Speech Commands (note this will run 4x4=16 experiments)
python -m fedstar.server --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50 dataset_name=speech_commands fedstar=false
python -m fedstar.clients --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50 dataset_name=speech_commands fedstar=false

# Generate FedStar results for Ambient Context
python -m fedstar.server --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50
python -m fedstar.clients --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50

# Generate FedStar results for Speech Commands
python -m fedstar.server --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50 dataset_name=speech_commands
python -m fedstar.clients --config-name table4 --multirun num_clients=5,10,15,30 L=L3,L5,L20,L50 dataset_name=speech_commands

Dataset	Clients	SupervisedFederatedLearning L=3 L=5 L=20 L=50 L=100	FedStar L=3 L=5 L=20 L=50
Ambient Context	5 10 15 30	43.75 45.17 63.40 67.57 75.27 41.75 44.90 56.23 60.82 70.39 43.18 42.75 49.60 59.47 68.41 36.44 38.73 48.91 55.83 67.23	49.60 52.78 66.12 66.71 47.84 52.48 61.98 63.46 49.05 56.05 64.25 64.05 47.34 46.51 60.45 55.47
Speech Commands	5 10 15 30	80.83 87.97 92.35 94.66 95.87 54.66 78.90 92.13 93.71 96.55 53.44 64.53 91.83 94.23 96.35 36.41 42.87 83.68 93.18 94.24	87.39 90.11 94.09 94.85 87.15 91.06 94.74 96.15 86.78 91.01 95.21 95.70 80.89 85.62 94.08 94.18

Dataset

Clients

SupervisedFederatedLearning
L=3 L=5 L=20 L=50 L=100

FedStar
L=3 L=5 L=20 L=50

Ambient

Context

5
10
15
30

43.75 45.17 63.40 67.57 75.27
41.75 44.90 56.23 60.82 70.39
43.18 42.75 49.60 59.47 68.41
36.44 38.73 48.91 55.83 67.23

49.60 52.78 66.12 66.71
47.84 52.48 61.98 63.46
49.05 56.05 64.25 64.05
47.34 46.51 60.45 55.47

Speech

Commands

5
10
15
30

80.83 87.97 92.35 94.66 95.87
54.66 78.90 92.13 93.71 96.55
53.44 64.53 91.83 94.23 96.35
36.41 42.87 83.68 93.18 94.24

87.39 90.11 94.09 94.85
87.15 91.06 94.74 96.15
86.78 91.01 95.21 95.70
80.89 85.62 94.08 94.18