Use with Local Data#
You can partition your local files and Python objects in
Flower Datasets
library using any available Partitioner
.
This guide details how to create a Hugging Face Dataset which is the required type of input for Partitioners. We will cover:
local files: CSV, JSON, image, audio,
in-memory data: dictionary, list, pd.DataFrame, np.ndarray.
General Overview#
An all-in-one dataset preparation (downloading, preprocessing, partitioning) happens using FederatedDataset. However, we will use only the Partitioner here since we use locally accessible data.
The rest of this guide will explain how to create a Dataset from local files and existing (in memory) Python objects.
Local Files#
CSV#
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
# Single file
data_files = "path-to-my-file.csv"
# Multiple Files
data_files = [ "path-to-my-file-1.csv", "path-to-my-file-2.csv", ...]
dataset = load_dataset("csv", data_files=data_files)
# Divided Dataset
data_files = {
"train": single_train_file_or_list_of_files,
"test": single_test_file_or_list_of_files,
"can-have-more-splits": ...
}
dataset = load_dataset("csv", data_files=data_files)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
JSON#
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
# Single file
data_files = "path-to-my-file.json"
# Multitple Files
data_files = [ "path-to-my-file-1.json", "path-to-my-file-2.json", ...]
dataset = load_dataset("json", data_files=data_files)
# Divided Dataset
data_files = {
"train": single_train_file_or_list_of_files,
"test": single_test_file_or_list_of_files,
"can-have-more-splits": ...
}
dataset = load_dataset("json", data_files=data_files)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Image#
You can create an image dataset in two ways:
give a path the directory
The directory needs to be structured in the following way: dataset-name/split/class/name. For example:
mnist/train/1/unique_name.png
mnist/train/1/unique_name.png
mnist/train/2/unique_name.png
...
mnist/test/1/unique_name.png
mnist/test/1/unique_name.png
mnist/test/2/unique_name.png
Then, the path you can give is ./mnist.
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
# Directly from a directory
dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
create a dataset from a CSV/JSON file and cast the path column to Image.
from datasets import Image, load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
dataset = load_dataset(...)
dataset = dataset.cast_column("path", Image())
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Audio#
Analogously to the image datasets, there are two methods here:
give a path to the directory
from datasets import load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
create a dataset from a CSV/JSON file and cast the path column to Audio.
from datasets import Audio, load_dataset
from flwr_datasets.partitioner import ChosenPartitioner
dataset = load_dataset(...)
dataset = dataset.cast_column("path", Audio())
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
In-Memory#
From dictionary#
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = {"features": [1, 2, 3], "labels": [0, 0, 1]}
dataset = Dataset.from_dict(data)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
From list#
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
my_list = [
{"features": 1, "labels": 0},
{"features": 2, "labels": 0},
{"features": 3, "labels": 1}
]
dataset = Dataset.from_list(my_list)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
From pd.DataFrame#
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = {"features": [1, 2, 3], "labels": [0, 0, 1]}
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
From np.ndarray#
The np.ndarray will be first transformed to pd.DataFrame
from datasets import Dataset
from flwr_datasets.partitioner import ChosenPartitioner
data = np.array([[1, 2, 3], [0, 0, 1]]).T
# You can add the column names by passing columns=["features", "labels"]
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)
partitioner = ChosenPartitioner(...)
partitioner.dataset = dataset
partition = partitioner.load_partition(partition_id=0)
Partitioner Details#
Partitioning is triggered automatically during the first load_partition
call.
You do not need to call any “do_partitioning” method.
Partitioner abstraction is designed to allow for a single dataset assignment.
partitioner.dataset = your_dataset
If you need to do the same partitioning on a different dataset, create a new Partitioner for that, e.g.:
from flwr_datasets.partitioner import IidPartitioner
iid_partitioner_for_mnist = IidPartitioner(num_partitions=10)
iid_partitioner_for_mnist.dataset = mnist_dataset
iid_partitioner_for_cifar = IidPartitioner(num_partitions=10)
iid_partitioner_for_cifar.dataset = cifar_dataset
More Resources#
If you are looking for more details or you have not found the format you are looking for, please visit the HuggingFace Datasets docs. This guide is based on the following ones: