Merge branch 'main' of github.com:adap/flower

* 'main' of github.com:adap/flower: Fix default contiguous value in IidPartitioner (#2406) Update FDS docs index (#2337) Add TensorFlow integration tests with FDS (#2350) Add paths specification to CI triggers for FDS (#2399) Add FDS how-to guides (#2332) Add Flower Datasets tests as GitHub workflow (#2345) Fix the reference API documentation (#2397) Add FDS tutorial docs (#2375) Update tutorial-series-what-is-federated-learning.ipynb (#2396) Fix missing square in proximal term of FedProx baseline (#2210)
adap · Sep 22, 2023 · e90c5df · e90c5df
2 parents 0f5cc06 + b63b775
commit e90c5df
Show file tree

Hide file tree

Showing 21 changed files with 990 additions and 36 deletions.
diff --git a/.github/workflows/datasets.yml b/.github/workflows/datasets.yml
@@ -0,0 +1,46 @@
+name: Datasets
+
+on:
+ push:
+ branches:
+ - main
+ paths:
+ - "datasets/**"
+ pull_request:
+ branches:
+ - main
+ paths:
+ - "datasets/**"
+
+concurrency:
+ group: ${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_id || github.event.pull_request.number || github.ref }}
+ cancel-in-progress: true
+
+defaults:
+ run:
+ working-directory: datasets
+
+jobs:
+ test_core:
+ runs-on: ubuntu-22.04
+ strategy:
+ matrix:
+ # Latest version which comes cached in the host image can be found here:
+ # https:/actions/runner-images/blob/main/images/linux/Ubuntu2204-Readme.md#python
+ # In case of a mismatch, the job has to download Python to install it.
+ # Note: Due to a bug in actions/setup-python we have to put 3.10 in
+ # qoutes as it will otherwise will assume 3.1
+ python: [3.8, 3.9, '3.10']
+
+ name: Python ${{ matrix.python }}
+
+ steps:
+ - uses: actions/checkout@v4
+ - name: Bootstrap
+ uses: ./.github/actions/bootstrap
+ with:
+ python-version: ${{ matrix.python }}
+ - name: Install dependencies (mandatory only)
+ run: python -m poetry install --all-extras
+ - name: Test (formatting + unit tests)
+ run: ./dev/test.sh
diff --git a/baselines/fedprox/README.md b/baselines/fedprox/README.md
@@ -59,6 +59,13 @@ The following table shows the main hyperparameters for this baseline with their
 To construct the Python environment, simply run:
 
 ```bash
+# Set directory to use python 3.10 (install with `pyenv install <version>` if you don't have it)
+pyenv local 3.10.12
+
+# Tell poetry to use python3.10
+poetry env use 3.10.12
+
+# Install
 poetry install
 ```
 
@@ -97,6 +104,6 @@ python -m fedprox.main --multirun mu=0.0,2.0 stragglers_fraction=0.0,0.5,0.9 '+r
 python -m fedprox.main --config-name fedavg --multirun stragglers_fraction=0.0,0.5,0.9 '+repeat_num=range(5)'
 ```
 
-The above commands would generate results that you can plot and would look like:
+The above commands would generate results that you can plot and would look like the plot shown below. This plot was generated using the jupyter notebook in the `docs/` directory of this baseline after running the `--multirun` commands above.
 
-![](docs/FedProx_mnist.png)
+![](_static/FedProx_mnist.png)
diff --git a/baselines/fedprox/_static/FedProx_mnist.png b/baselines/fedprox/_static/FedProx_mnist.png
diff --git a/baselines/fedprox/docs/viz_and_plot_results.ipynb b/baselines/fedprox/docs/viz_and_plot_results.ipynb
diff --git a/baselines/fedprox/fedprox/models.py b/baselines/fedprox/fedprox/models.py
@@ -55,8 +55,9 @@ class LogisticRegression(nn.Module):
 
  As described in the Li et al., 2020 paper :
 
- [Federated Optimization in Heterogeneous Networks]
- (https://arxiv.org/pdf/1812.06127.pdf)
+ [Federated Optimization in Heterogeneous Networks] (
+
+ https://arxiv.org/pdf/1812.06127.pdf)
  """
 
  def __init__(self, num_classes: int) -> None:
@@ -153,7 +154,7 @@ def _train_one_epoch( # pylint: disable=too-many-arguments
  optimizer.zero_grad()
  proximal_term = 0.0
  for local_weights, global_weights in zip(net.parameters(), global_params):
- proximal_term += (local_weights - global_weights).norm(2)
+ proximal_term += torch.square((local_weights - global_weights).norm(2))
  loss = criterion(net(images), labels) + (proximal_mu / 2) * proximal_term
  loss.backward()
  optimizer.step()

diff --git a/baselines/fedprox/pyproject.toml b/baselines/fedprox/pyproject.toml
@@ -41,6 +41,8 @@ python = ">=3.10.0, <3.11.0"
 flwr = { extras = ["simulation"], version = "1.5.0" }
 hydra-core = "1.3.2"
 matplotlib = "3.7.1"
+jupyter = "^1.0.0"
+pandas = "^2.0.3"
 torch = { url = "https://download.pytorch.org/whl/cu117/torch-2.0.1%2Bcu117-cp310-cp310-linux_x86_64.whl"}
 torchvision = { url = "https://download.pytorch.org/whl/cu117/torchvision-0.15.2%2Bcu117-cp310-cp310-linux_x86_64.whl"}
 

diff --git a/datasets/dev/test.sh b/datasets/dev/test.sh
@@ -2,6 +2,10 @@
 set -e
 cd "$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"/../
 
+# Append path to PYTHONPATH that makes flwr_tool.init_py_check discoverable
+PARENT_DIR=$(dirname "$(pwd)") # Go one dir up from flower/datasets
+export PYTHONPATH="${PYTHONPATH}:${PARENT_DIR}/src/py"
+
 echo "=== test.sh ==="
 
 echo "- Start Python checks"

diff --git a/datasets/doc/source/how-to-disable-enable-progress-bar.rst b/datasets/doc/source/how-to-disable-enable-progress-bar.rst
@@ -0,0 +1,16 @@
+Disable/Enable Progress Bar
+===========================
+
+You will see a progress bar by default when you download a dataset or apply a map function. Here is how you control
+this behavior.
+
+Disable::
+
+ from datasets.utils.logging import disable_progress_bar
+ disable_progress_bar()
+
+Enable::
+
+ from datasets.utils.logging import enable_progress_bar
+ enable_progress_bar()
+
diff --git a/datasets/doc/source/how-to-install-flwr-datasets.rst b/datasets/doc/source/how-to-install-flwr-datasets.rst
@@ -0,0 +1,46 @@
+Installation
+============
+
+Python Version
+--------------
+
+Flower Datasets requires `Python 3.8 <https://docs.python.org/3.8/>`_ or above.
+
+
+Install stable release (pip)
+----------------------------
+
+Stable releases are available on `PyPI <https://pypi.org/project/flwr_datasets/>`_
+
+.. code-block:: bash
+
+ python -m pip install flwr-datasets
+
+For vision datasets (e.g. MNIST, CIFAR10) ``flwr-datasets`` should be installed with the ``vision`` extra
+
+.. code-block:: bash
+
+ python -m pip install flwr_datasets[vision]
+
+For audio datasets (e.g. Speech Command) ``flwr-datasets`` should be installed with the ``audio`` extra
+
+.. code-block:: bash
+
+ python -m pip install flwr_datasets[audio]
+
+
+Verify installation
+-------------------
+
+The following command can be used to verify if Flower Datasets was successfully installed:
+
+.. code-block:: bash
+
+ python -c "import flwr_datasets;print(flwr_datasets.__version__)"
+
+If everything worked, it should print the version of Flower Datasets to the command line:
+
+.. code-block:: none
+
+ 0.0.1
+
diff --git a/datasets/doc/source/how-to-use-with-numpy.rst b/datasets/doc/source/how-to-use-with-numpy.rst
@@ -0,0 +1,61 @@
+Use with NumPy
+==============
+
+Let's integrate ``flwr-datasets`` with NumPy.
+
+Prepare the desired partitioning::
+
+ from flwr_datasets import FederatedDataset
+
+ fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+ partition = fds.load_partition(0, "train")
+ centralized_dataset = fds.load_full("test")
+
+Transform to NumPy::
+
+ partition_np = partition.with_format("numpy")
+ X_train, y_train = partition_np["img"], partition_np["label"]
+
+That's all. Let's check the dimensions and data types of our ``X_train`` and ``y_train``::
+
+ print(f"The shape of X_train is: {X_train.shape}, dtype: {X_train.dtype}.")
+ print(f"The shape of y_train is: {y_train.shape}, dtype: {y_train.dtype}.")
+
+You should see::
+
+ The shape of X_train is: (500, 32, 32, 3), dtype: uint8.
+ The shape of y_train is: (500,), dtype: int64.
+
+Note that the ``X_train`` values are of type ``uint8``. It is not a problem for the TensorFlow model when passing the
+data as input, but it might remind us to normalize the data - global normalization, pre-channel normalization, or simply
+rescale the data to [0, 1] range::
+
+ X_train = (X_train - X_train.mean()) / X_train.std() # Global normalization
+
+
+CNN Keras model
+---------------
+Here's a quick example of how you can use that data with a simple CNN model::
+
+ import tensorflow as tf
+ from tensorflow.keras import datasets, layers, models
+
+ model = models.Sequential([
+ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
+ layers.MaxPooling2D(2, 2),
+ layers.Conv2D(64, (3, 3), activation='relu'),
+ layers.MaxPooling2D(2, 2),
+ layers.Conv2D(64, (3, 3), activation='relu'),
+ layers.Flatten(),
+ layers.Dense(64, activation='relu'),
+ layers.Dense(10, activation='softmax')
+ ])
+
+ model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
+ metrics=['accuracy'])
+ model.fit(X_train, y_train, epochs=20, batch_size=64)
+
+You should see about 98% accuracy on the training data at the end of the training.
+
+Note that we used ``"sparse_categorical_crossentropy"``. Make sure to keep it that way if you don't want to one-hot-encode
+the labels.
diff --git a/datasets/doc/source/how-to-use-with-pytorch.rst b/datasets/doc/source/how-to-use-with-pytorch.rst
@@ -0,0 +1,67 @@
+Use with PyTorch
+================
+Let's integrate ``flwr-datasets`` with PyTorch DataLoaders and keep your PyTorch Transform applied to the data.
+
+Standard setup - download the dataset, choose the partitioning::
+
+ from flwr_datasets import FederatedDataset
+
+ fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+ partition = fds.load_partition(0, "train")
+ centralized_dataset = fds.load_full("test")
+
+Determine the names of our features (you can alternatively do that directly on the Hugging Face website). The name can
+vary e.g. "img" or "image", "label" or "labels"::
+
+ partition.features
+
+In case of CIFAR10, you should see the following output
+
+.. code-block:: none
+
+ {'img': Image(decode=True, id=None),
+ 'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog',
+ 'frog', 'horse', 'ship', 'truck'], id=None)}
+
+Apply Transforms, Create DataLoader. We will use the `map() <https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Dataset.map>`_
+function. Please note that the map will modify the existing dataset if the key in the dictionary you return is already present
+and append a new feature if it did not exist before. Below, we modify the "img" feature of our dataset.::
+
+ from torch.utils.data import DataLoader
+ from torchvision.transforms import ToTensor
+
+ transforms = ToTensor()
+ partition_torch = partition.map(
+ lambda img: {"img": transforms(img)}, input_columns="img"
+ ).with_format("torch")
+ dataloader = DataLoader(partition_torch, batch_size=64)
+
+We advise you to keep the
+`ToTensor() <https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html>`_ transform (especially if
+you used it in your PyTorch code) because it swaps the dimensions from (H x W x C) to (C x H x W). This order is
+expected by a model with a convolutional layer.
+
+If you want to divide the dataset, you can use (at any point before passing the dataset to the DataLoader)::
+
+ partition_train_test = partition.train_test_split(test_size=0.2)
+ partition_train = partition_train_test["train"]
+ partition_test = partition_train_test["test"]
+
+Or you can simply calculate the indices yourself::
+
+ partition_len = len(partition)
+ partition_train = partition[:int(0.8 * partition_len)]
+ partition_test = partition[int(0.8 * partition_len):]
+
+And during the training loop, you need to apply one change. With a typical dataloader, you get a list returned for each iteration::
+
+ for batch in all_from_pytorch_dataloader:
+ images, labels = batch
+ # Or alternatively:
+ # images, labels = batch[0], batch[1]
+
+With this dataset, you get a dictionary, and you access the data a little bit differently (via keys not by index)::
+
+ for batch in dataloader:
+ images, labels = batch["img"], batch["label"]
+
diff --git a/datasets/doc/source/how-to-use-with-tensorflow.rst b/datasets/doc/source/how-to-use-with-tensorflow.rst
@@ -0,0 +1,74 @@
+Use with TensorFlow
+===================
+
+Let's integrate ``flwr-datasets`` with TensorFlow. We show you three ways how to convert the data into the formats
+that ``TensorFlow``'s models expect. Please note that, especially for the smaller datasets, the performance of the
+following methods is very close. We recommend you choose the method you are the most comfortable with.
+
+NumPy
+-----
+The first way is to transform the data into the NumPy arrays. It's an easier option that is commonly used. Feel free to
+follow the :doc:`how-to-use-with-numpy` tutorial, especially if you are a beginner.
+
+.. _tensorflow-dataset:
+
+TensorFlow Dataset
+------------------
+Work with ``TensorFlow Dataset`` abstraction.
+
+Standard setup::
+
+ from flwr_datasets import FederatedDataset
+
+ fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+ partition = fds.load_partition(0, "train")
+ centralized_dataset = fds.load_full("test")
+
+Transformation to the TensorFlow Dataset::
+
+ tf_dataset = partition.to_tf_dataset(columns="img", label_cols="label", batch_size=64,
+ shuffle=True)
+ # Assuming you have defined your model and compiled it
+ model.fit(tf_dataset, epochs=20)
+
+TensorFlow Tensors
+------------------
+Change the data type to TensorFlow Tensors (it's not the TensorFlow dataset).
+
+Standard setup::
+
+ from flwr_datasets import FederatedDataset
+
+ fds = FederatedDataset(dataset="cifar10", partitioners={"train": 10})
+ partition = fds.load_partition(0, "train")
+ centralized_dataset = fds.load_full("test")
+
+Transformation to the TensorFlow Tensors ::
+
+ data_tf = partition.with_format("tf")
+ # Assuming you have defined your model and compiled it
+ model.fit(data_tf["img"], data_tf["label"], epochs=20, batch_size=64)
+
+CNN Keras Model
+---------------
+Here's a quick example of how you can use that data with a simple CNN model (it assumes you created the TensorFlow
+dataset as in the section above, see :ref:`TensorFlow Dataset <tensorflow-dataset>`)::
+
+ import tensorflow as tf
+ from tensorflow.keras import datasets, layers, models
+
+ model = models.Sequential([
+ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
+ layers.MaxPooling2D(2, 2),
+ layers.Conv2D(64, (3, 3), activation='relu'),
+ layers.MaxPooling2D(2, 2),
+ layers.Conv2D(64, (3, 3), activation='relu'),
+ layers.Flatten(),
+ layers.Dense(64, activation='relu'),
+ layers.Dense(10, activation='softmax')
+ ])
+
+ model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
+ metrics=['accuracy'])
+ model.fit(tf_dataset, epochs=20)
+