Generalized Linear Models for VFL (VFL-GLM)

The AWS task runner sample notebook (integrateai_taskrunner_AWS.ipynb) provides sample code for running the SDK, and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information.

Prerequisites

  1. Open the integrateai_taskrunner_AWS.ipynb notebook to test the code as you walk through this tutorial.

  2. Make sure you have downloaded the sample files for VFL: https://s3.ca-central-1.amazonaws.com/public.s3.integrate.ai/integrate_ai_examples/vfl.zip and uploaded them to your S3 bucket.

  3. Define the dataset paths as shown in the sample notebook.

Note: By default the task runner creates a bucket for you to upload data into (e.g. s3://{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai ). Only the default S3 bucket and other buckets ending in *integrate.ai are supported. If you are not using the default bucket created by the task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.

aws_taskrunner_profile = "<myworkspace>" # This is your workspace name
aws_taskrunner_name = "<mytaskrunner>" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

# Example datapaths. Make sure that the data you want to work with exists in the base_aws_bucket for your task runner.
# Sample PRL/VFL datapaths
active_train_path = f's3://{base_aws_bucket}/vfl/active_train.parquet'
active_test_path = f's3://{base_aws_bucket}/vfl/active_test.parquet'
passive_train_path = f's3://{base_aws_bucket}/vfl/passive_train.parquet'
passive_test_path = f's3://{base_aws_bucket}/vfl/passive_test.parquet'

#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'

#Where to store VFL predictions - must be full path and file name
vfl_predict_active_storage_path = f's3://{base_aws_bucket}/vfl_predict/active_predictions.csv'
vfl_predict_passive_storage_path = f's3://{base_aws_bucket}/vfl_predict/passive_predictions.csv'

Run a PRL session to align the datasets for VFL-GLM

Before you run a VFL session, you must first run a PRL session to align the datasets.

For more information, see Private Record Linkage (PRL) Sessions.

Review the sample data and model configuration for VFL-GLM

Model Config for VFL-GLM

integrate.ai has a model class available for Generalized Linear Models, called iai_glm. This model is defined using a JSON configuration file during session creation (model_config).

model_config = {
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        "passive_client": {"params": {"input_size": 7, "output_activation": "sigmoid"}},
        "active_client": {"params": {"input_size": 8, "output_activation": "sigmoid"}},
    },
    "ml_task": {
        "type": "logistic",
        "params": {},
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.2, "momentum": 0.0}},
    "seed": 23,  # for reproducibility
}

You can adjust the following model_config parameters as needed:

  • strategy - The aggregation strategy for federated training. Use the same value as the sample config.

  • model - The model structure for participating clients, with a separate dictionary for each client, keyed by the client name.

    • input_size: the number features in the input data for the corresponding client. NOTE: if a client only provides the target with no features, it should be excluded from this field.

    • output_activation: the function to apply to the output of the linear term. This is equivalent to the inverse of the link function in the context of GLM. Should be the same for all clients. Choose from the following:

      • None: Do not apply any function. This is the default value when not specified. Equivalent to the identity link function.

      • ”exp”: Apply the exponential function. Equivalent to the log link function.

      • “sigmoid”: Apply the sigmoid function. Equivalent to the logit link function.

  • ml_task type - The type of the machine learning task, which determines the assumed distribution of the target variable. Choose from the following:

    • logistic: for targets that follow a Bernoulli distribution

    • normal: for targets that follow a Gaussian distribution

    • poisson: for targets that follow a Poisson distribution

    • gamma: for targets that follow a Gamma distribution

    • inverseGaussian: for targets that follow an inverse Gaussian distribution

    • tweedie: for targets that follow a Tweedie distribution. The power of the Tweedie distribution can be specified under the params field. For example, to conduct a Tweedie regression with power 1.5, specify as follows:
      "ml_task": { "type": "tweedie", "params": {“power”: 1.5}, },

* `optimizer` - The optimizer to use for the training process. We support all the PyTorch optimizers listed [here](https://pytorch.org/docs/stable/optim.html#algorithms), except for the LBFGS. Set `name` as the corresponding optimizer name, e.g., SGD, Adam. Extra parameters for the optimizer can be specified under the field `params`.
  • seed - Integer to be used as random seed, for reproducibility.

Data Config for VFL-GLM

The data_config specifies the structure of the input data for participating clients, with a separate dictionary for each client, keyed by the client name.

data_config = {
        "passive_client": {
            "label_client": False,
            "predictors": ["x1", "x3", "x5", "x7", "x9", "x11", "x13"],
            "target": None,
        },
        "active_client": {
            "label_client": True,
            "predictors": ["x0", "x2", "x4", "x6", "x8", "x10", "x12", "x14"],
            "target": "y",
        },
    }

Each client dictionary contains the following fields:

  • label_client: binary flag (True/False), indicating whether the corresponding client is the one that provides label / target.

  • predictors: list of strings or integers, specifying the columns to use as features / predictors. If elements are strings, they represent the column names in the dataset. If elements are integers, they represent the column indices in the dataset (use only when you are confident about the column order in the data). If the corresponding client does not provide any feature, but only the target, set predictors as None.

  • target: a single string or integer, specifying the column to use as the label / target. Strings represent column names, and integers represent column indices (use only when you are confident about the column order in the data). If the corresponding client is not the label client, set target as None.

For more information, see the scikit documentation for Generalized Linear Models, or Generalized linear model on Wikipedia.

The notebook also provides a sample data schema. For the purposes of testing VFL-GLM, use the sample schema as shown.

Create a VFL-GLM training session

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session. Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and two (2) rounds. It returns a session ID that you can use to track and reference your session.

vfl_train_session = client.create_vfl_session(
    name="Testing notebook - VFL GLM Train",
    description="I am testing VFL GLM training session creation with a task runner through a notebook",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config
).start()

vfl_train_session.id 

Create and start a task group to run the training session

The next step is to join the session with the sample data. This example has data for two datasets simulating two clients, as specified with the min_num_clients argument. The session begins training once the minimum number of clients have joined the session.

Each client must have a unique name that matches the name specified in the data_config. For example, active_client and passive_client.

# Create and start a task group with one task for each of the clients joining the session

vfl_task_group_context = (SessionTaskGroup(vfl_train_session)\
    .add_task(iai_tb_aws.vfl_train(train_path=active_train_path, 
                                    test_path=active_test_path, 
                                    batch_size=1024,  
                                    client_name="active_client"))\
    .add_task(iai_tb_aws.vfl_train(train_path=passive_train_path, 
                                    test_path=passive_test_path, 
                                    batch_size=1024, 
                                    client_name="passive_client"))\
    .start())

Poll for Session Results

Sessions may take some time to run depending on the computer environment. In the sample notebook and this tutorial, we poll the server to determine the session status.

# Check the status of the tasks
for i in vfl_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))
vfl_task_group_context.monitor_task_logs()
# Wait for the session to complete
vfl_task_group_context.wait(60*8, 2)

View the training metrics

After the session completes successfully, you can view the training metrics and start making predictions.

  1. Retrieve the model metrics as_dict.

  2. Plot the metrics.

metrics = vfl_train_session.metrics().as_dict()
metrics
fig = vfl_train_session.metrics().plot()

VFL-GLM Prediction

After you have completed a successful PRL session and a VFL-GLM train session, you can use those two sessions to create a VFL-GLM prediction session.

In the example, the sessions are run in sequence, so the session IDs for the PRL and VFL train sessions are readily available to use in the predict session. If instead you run a PRL session and want to reuse the session later in a different VFL session, make sure that you save the session ID (prl_session.id). Then you can provide the session ID directly in the predict session setup instead of relying on the variable. The 3 sessions must use the same client_name and data in order to run successfully.

# Create and start a VFL-GLM prediction session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL-GLM Predict",
    description="I am testing VFL-GLM prediction session creation with an AWS task runner through a notebook",
    prl_session_id=prl_session.id,
    training_session_id=vfl_train_session.id,
    vfl_mode="predict",
    data_config=data_config,
).start()

vfl_predict_session.id 
# Create and start a task group with one task for each of the clients joining the session

vfl_predict_task_group_context = (SessionTaskGroup(vfl_predict_session)\
.add_task(iai_tb_aws.vfl_predict(
        client_name="active_client", 
        dataset_path=active_test_path, 
        raw_output=True,
        batch_size=1024, 
        storage_path=vfl_predict_active_storage_path))\
.add_task(iai_tb_aws.vfl_predict(
        client_name="passive_client",
        dataset_path=passive_test_path,
        batch_size=1024,
        raw_output=True,
        storage_path=vfl_predict_passive_storage_path))\
.start())
# Check the status of the tasks
for i in vfl_predict_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))
vfl_predict_task_group_context.monitor_task_logs()
# Wait for the tasks to complete (success = True)
vfl_predict_task_group_context.wait(60*5, 2)

View the training metrics

Now you can view the predictions and evaluate the performance.

# Retrieve the metrics
metrics = vfl_predict_session.metrics().as_dict()
metrics
import pandas as pd
presigned_result_urls = vfl_predict_session.prediction_result()

df_pred = pd.read_csv(presigned_result_urls.get(vfl_predict_active_storage_path))

df_pred.head()

For more information about VFL predict mode, see VFL Prediction Session Example.

Back to VFL Model Training