Use with PyTorch#
Let’s integrate flwr-datasets
with PyTorch DataLoaders and keep your PyTorch Transform applied to the data.
Standard setup - download the dataset, choose the partitioning:
from flwr_datasets import FederatedDataset
fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
partition = fds.load_partition(0, "train")
centralized_dataset = fds.load_split("test")
Determine the names of the features (you can alternatively do that directly on the Hugging Face website). The name can vary e.g. “img” or “image”, “label” or “labels”:
partition.features
In case of CIFAR10, you should see the following output.
{'img': Image(decode=True, id=None),
'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
'frog', 'horse', 'ship', 'truck'], id=None)}
Apply Transforms, Create DataLoader. We will use Dataset.with_transform(). It works on-the-fly, meaning the transforms you specified will be applied only when you access the data, which is also how the transforms work in the PyTorch ecosystem. The last detail is to know that this function works on the batches of data (even if you select a single element, it is represented as a batch). That is why we iterate over all the samples from this batch and apply our transforms:
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
transforms = ToTensor()
def apply_transforms(batch):
batch["img"] = [transforms(img) for img in batch["img"]]
return batch
partition_torch = partition.with_transform(apply_transforms)
# Now, you can check if you didn't make any mistakes by calling partition_torch[0]
dataloader = DataLoader(partition_torch, batch_size=64)
Alternatively, you can use the map()
function. Note that the operation is instant (contrary to the set_transform
and with_transform
). Remember that the map
will modify the existing dataset if the key in the dictionary you return is already present and append a new feature if
it did not exist before. Below, we modify the "img"
feature of our dataset.:
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
transforms = ToTensor()
partition_torch = partition.map(
lambda img: {"img": transforms(img)}, input_columns="img"
).with_format("torch")
dataloader = DataLoader(partition_torch, batch_size=64)
We advise you to keep the ToTensor() transform (especially if you used it in your PyTorch code) because it swaps the dimensions from (H x W x C) to (C x H x W). This order is expected by a model with a convolutional layer.
If you want to divide the dataset, you can use (at any point before passing the dataset to the DataLoader):
partition_train_test = partition.train_test_split(test_size=0.2, seed=42)
partition_train = partition_train_test["train"]
partition_test = partition_train_test["test"]
Or you can simply calculate the indices yourself:
partition_len = len(partition)
# Split `partition` 80:20
num_train_examples = int(0.8 * partition_len)
partition_train = partition.select(range(num_train_examples)) ) # use first 80%
partition_test = partition.select(range(num_train_examples, partition_len)) ) # use last 20%
And during the training loop, you need to apply one change. With a typical dataloader, you get a list returned for each iteration:
for batch in all_from_pytorch_dataloader:
images, labels = batch
# Or alternatively:
# images, labels = batch[0], batch[1]
With this dataset, you get a dictionary, and you access the data a little bit differently (via keys not by index):
for batch in dataloader:
images, labels = batch["img"], batch["label"]