HFL-GLM Model Training with a Sample Local Dataset (iai_glm)

An end-to-end tutorial for how to train an existing model with a synthetic dataset on a local machine.

To help you get started, we provide a tutorial based on synthetic data with pre-built configuration files. In this tutorial, you will train a generalized linear model (iai_glm) using data from two datasets. The datasets, model, and data configuration are provided for you.

The sample notebook (integrateai_HFL.ipynb) contains runnable code snippets for exploring the SDK and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information to expand on the code demonstration.

Before you get started, complete the environment setup for your local machine.

Open the integrateai_api.ipynb notebook to test the code as you walk through this exercise.

Understanding Models

model_config = {
    "experiment_name": "test_synthetic_tabular",
    "experiment_description": "test_synthetic_tabular",
    "strategy": {     
        "name": "FedAvg",       // Name of the federated learning strategy
        "params": {}
        },
    "model": {                  // Parameters specific to the model type 
        "params": {
            "input_size": 15, 
            "output_activation": "sigmoid"
                   }
            },
    "balance_train_datasets": False, // Performs undersampling on the dataset
    "ml_task": {
		"type": "logistic",
		"params": {
			"alpha": 0,
			"l1_ratio": 0
		}
	},
	"optimizer": {
		"name": "SGD",
		"params": {
			"learning_rate": 0.05,
			"momentum": 0.9
		}
	},
	"differential_privacy_params": {
		"epsilon": 4,
		"max_grad_norm": 7
	},
}

integrate.ai has a standard model class available for Generalized Linear Models (iai_glm). These standard models are defined using JSON configuration files during session creation.

The model configuration (model_config) is a JSON object that contains the model parameters for the session. There are five main properties with specific key-value pairs used to configure the model:

  • strategy - Select one of the available federated learning strategies from the strategy library.

  • model - Defines the specific parameters required for the model type. ** output_activation should be set as the “inverse link function“ for GLM. For example, sigmoid for the logit link, and exp for the log link. Currently supported values include sigmoid, exp, tanh.

  • ml-task - Defines the federated learning task and associated parameters. The following values are supported: logistic, normal, poisson, gamma, tweedie, and inverseGaussian. Tweedie has an additional power parameter to control the underlying target distribution. Choose the ml-task based on the type of the target variable. For example, logistic for binary target and poisson for counts.

  • optimizer - Defines the parameters for the PyTorch optimizer.

  • differential_privacy_params - Defines the differential privacy parameters. See (Differential Privacy) for more information.

The example in the notebook is a model provided by integrate.ai. For this tutorial, you do not need to change any of the values.

data_config = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

The data configuration is a JSON object where the user specifies predictor and target columns that are used to describe input data. This is the same structure for both GLM and FNN.

Once you have created or updated the model and data, the next step is to create a training session to begin working with the model and datasets.

Create and start the training session

session = client.create_fl_session(
    name="Testing notebook",
    description="I am testing session creation through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_ffnet",
    model_config=model_config,
    data_config=data_config,
).start()

session.id

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session. Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training datasets (specified as min_num_clients) and two rounds (num_rounds). It returns a session ID that you can use to track and reference your session.

The package_name specifies the federated learning model package - in the example, it is iai_glm however, other packages are supported. See Model packages for more information.

Join clients to the session

import subprocess
data_path = "~/Downloads/synthetic"
#Join dataset one (silo0)
subprocess.Popen(f"iai client train --token {IAI_TOKEN} --session {session.id} --train-path {data_path}/train_silo0.parquet --test-path {data_path}/test.parquet --batch-size 1024 --client-name client-1 --remove-after-complete",
    shell=True
)
#Join dataset two (silo1)
subprocess.Popen(f"iai client train --token {IAI_TOKEN} --session {session.id} --train-path {data_path}/train_silo1.parquet --test-path {data_path}/test.parquet --batch-size 1024 --client-name client-2 --remove-after-complete",
    shell=True
)

The next step is to join the session with the sample data. This example has data for two datasets simulating two clients, as specified with the min_num_clients argument. Therefore, to run this example, you will call subprocess.Popen twice to connect each dataset to the session as a separate client.

The session begins training once the minimum number of clients have joined the session.

Each client runs as a separate Docker container to simulate distributed data silos.

  • data_path is the path to the sample data on your local machine

  • IAI_TOKEN is your access token

  • session.id is the ID returned by the previous step ()

  • train-path is the path to and name of the sample dataset file

Poll for session results

import time

current_round = None
current_status = None
while client_1.poll() is None or client_2.poll() is None:
    output1 = client_1.stdout.readline().decode("utf-8").strip()
    output2 = client_2.stdout.readline().decode("utf-8").strip()
    if output1:
        print("silo1: ", output1)
    if output2:
        print("silo2: ", output2)

    # poll for status and round
    if current_status != training_session.status:
        print("Session status: ", training_session.status)
        current_status = training_session.status
    if current_round != training_session.round and training_session.round > 0:
        print("Session round: ", training_session.round)
        current_round = training_session.round
    time.sleep(1)

output1, error1 = client_1.communicate()
output2, error2 = client_2.communicate()

print(
    "client_1 finished with return code: %d\noutput: %s\n  %s"
    % (client_1.returncode, output1.decode("utf-8"), error1.decode("utf-8"))
)
print(
    "client_2 finished with return code: %d\noutput: %s\n  %s"
    % (client_2.returncode, output2.decode("utf-8"), error2.decode("utf-8"))
)

Depending on the type of session and the size of the datasets, sessions may take some time to run. In the sample notebook and this tutorial, we poll the server to determine the session status.

You can log information about the session during this time. In this example, we are logging the current round and the clients that have joined the session.

Another popular option is to log the session.metrics().as_dict() to view the in-progress training metrics.

Session Complete

Congratulations, you have your first federated model! You can test it by making predictions. For more information, see Making Predictions.

Back to HFL model overview