This example runs a credit card fraud detection algorithm on the Swarm Learning platform. It uses Keras and TensorFlow.
This example uses a subset of the data from1. This subset is balanced and is created as a 50:50 data set with equal distribution of fraud and non-fraud cases.
This example uses four training batches and one test batch. These files are in the examples/fraud-detection/app-data
directory.
NOTE: For data license associated with this dataset, see /examples/fraud-detection/data-and-scratch/app-data/data_license.md
.
The ML program, after conversion to Swarm Learning, is in examples/fraud-detection/model
and is called fraud-detection.py
.
This example shows the Swarm training of the credit card fraud detection model using four ML nodes. ML nodes along with SL nodes are automatically spawned by SWOP nodes - all running on a single host. Swarm training gets initiated by the SWCI node and orchestrated by one SN node running on the same host. This example also shows how private data, private scratch area, and shared model can be mounted to ML nodes for Swarm training. For more information, see the profile files and task definition files placed under examples/fraud-detection/swop
and examples/fraud-detection/swci
.
The following image illustrates a cluster setup that uses only one host:
-
This example uses one SN node. The names of the docker containers representing this node is SN1. SN1 is also the Sentinel Node. SN1 runs on the host 172.1.1.1.
-
Four SL and ML nodes are automatically spawned by SWOP node during training and removed after the training. This example uses one SWOP node that connects to the SN node. The names of the docker containers representing this SWOP node is SWOP1. SWOP1 runs on the host 172.1.1.1.
-
Training is initiated by SWCI node (SWCI1) that runs on the host 172.1.1.1.
-
This example assumes that License Server already runs on the host 172.1.1.1. All Swarm nodes connect to the License Server, on its default port 5814.
- On host-1, navigate to
swarm-learning
folder (that is, parent to examples directory).
cd swarm-learning
- On host-1, create a temporary
workspace
directory,fraud-detection
example, andgen-cert
utility.
mkdir workspace
cp -r examples/fraud-detection workspace/
cp -r examples/utils/gen-cert workspace/fraud-detection/
- This example has a separate private
data-and-scratch
directories for each user or ML node. Create the respective directories and copydata-and-scratch
directory. Running this example creates ascratch
directory for each user and saves the trained Swarm model in the directory at the end of the training.
mkdir workspace/fraud-detection/user1 workspace/fraud-detection/user2
mkdir workspace/fraud-detection/user3 workspace/fraud-detection/user4
cp -r workspace/fraud-detection/data-and-scratch workspace/fraud-detection/user1/
cp -r workspace/fraud-detection/data-and-scratch workspace/fraud-detection/user2/
cp -r workspace/fraud-detection/data-and-scratch workspace/fraud-detection/user3/
mv workspace/fraud-detection/data-and-scratch workspace/fraud-detection/user4/
- Run the
gen-cert
utility to generate certificates for each Swarm component using the command,gen-cert -e <EXAMPLE-NAME> -i <HOST-INDEX>
.
./workspace/fraud-detection/gen-cert -e fraud-detection -i 1
-
Search and replace all occurrences of
<CURRENT-PATH>
tag inswarm_fd_task.yaml
andswop1_profile.yaml
files with$(pwd)
.sed -i "s+<CURRENT-PATH>+$(pwd)+g" workspace/fraud-detection/swop/swop*_profile.yaml workspace/fraud-detection/swci/taskdefs/swarm_fd_task.yaml
-
Create a docker volume and copy Swarm Learning wheel file.
docker volume create sl-cli-lib
docker container create --name helper -v sl-cli-lib:/data hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn:1.0.0
docker cp -L lib/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl helper:/data
docker rm helper
- Create a docker network for SN, SWOP, SWCI, SL, and user containers running on the same host.
docker network create host-1-net
- Run Swarm Network node (SN1) - sentinel node.
./scripts/bin/run-sn -d --rm --name=sn1 \
--network=host-1-net --host-ip=sn1 --sentinel \
--key=workspace/fraud-detection/cert/sn-1-key.pem \
--cert=workspace/fraud-detection/cert/sn-1-cert.pem \
--capath=workspace/fraud-detection/cert/ca/capath \
--apls-ip=172.1.1.1
Use the Docker logs command to monitor the Sentinel SN node and wait for the node to finish initializing. The Sentinel node is ready when the following messages appear in the log output:
swarm.blCnt : INFO : Starting SWARM-API-SERVER on port: 30304
- Run Swarm Operator node (SWOP1).
NOTE: If required, according to environment, modify IP and proxy in the profile files under
workspace/fraud-detection/swop
folder.
./scripts/bin/run-swop -d --rm --name=swop1 \
--network=host-1-net --usr-dir=workspace/fraud-detection/swop \
--profile-file-name=swop1_profile.yaml \
--key=workspace/fraud-detection/cert/swop-1-key.pem \
--cert=workspace/fraud-detection/cert/swop-1-cert.pem \
--capath=workspace/fraud-detection/cert/ca/capath \
-e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
- Run SWCI node (SWCI1). It creates, finalizes and assigns below task to task-framework for sequential execution:
-
user_env_tf_build_task
: Builds TensorFlow based Docker image for ML node to run model training. -
swarm_fd_task
: Create containers out of ML image, and mounts model and data path to run Swarm training.
NOTE: If required, according to environment, modify SN IP in
workspace/fraud-detection/swci/swci-init
file.
./scripts/bin/run-swci -ti --rm --name=swci1 \
--network=host-1-net --usr-dir=workspace/fraud-detection/swci \
--init-script-name=swci-init \
--key=workspace/fraud-detection/cert/swci-1-key.pem \
--cert=workspace/fraud-detection/cert/swci-1-cert.pem \
--capath=workspace/fraud-detection/cert/ca/capath \
-e http_proxy= -e https_proxy= --apls-ip=172.1.1.1
- Four nodes of Swarm training are automatically started when the run task (
swarm_fd_task
) gets assigned and executed. Open a new terminal on host-1 and monitor the Docker logs of ML nodes for Swarm training. Swarm training ends with the following log message:
SwarmCallback : INFO : All peers and Swarm training rounds finished. Final Swarm model was loaded.
Final Swarm model is saved inside each user’s private scratch
directory, which is workspace/fraud-detection/user<id>/data-and-scratch/scratch
on both the hosts. All the dynamically spawned SL and ML nodes exits after Swarm training. The SN and SWOP nodes continues to run.
- To clean up, run the
scripts/bin/stop-swarm
script on all the systems to stop and remove the container nodes of the previous run. If required, backup the container logs and remove Docker network (host-1-net
) and Docker volume (sl-cli-lib
), and delete the workspace directory.