Generalized Linear Models for VFL (VFL-GLM)

The AWS task runner sample notebook (integrateai_demo_AWS.ipynb) provides sample code for running the SDK, and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information.

Prerequisites

  1. Open the integrateai_demo_AWS.ipynb notebook to test the code as you walk through this tutorial.

  2. Make sure you have downloaded the sample files: https://s3.ca-central-1.amazonaws.com/public.s3.integrate.ai/integrate_ai_examples/vfl.zip and uploaded them to your S3 bucket.

  3. Register the datasets with the paths shown in the sample notebook.

# Example dataset names. Make sure that the datasets you want to work with are registered in your workspace.

consumer_train_path = 'consumer_train'
consumer_test_path = 'consumer_test'
acme_data = 'acme_train' 

Run a PRL session to determine overlap

Before you run a VFL session, you must first run a PRL session to determine the overlap between the datasets.

For more information, see Private Record Linkage (PRL) Sessions.

Review the sample data and model configuration for VFL-GLM

Model Config for VFL-GLM

integrate.ai has a model class available for Generalized Linear Models, called iai_glm. This model is defined using a JSON configuration file during session creation (model_config).

model_config = {                                       
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        acme_data: {"params": {"input_size": len(vendor_features), "output_activation": "exp"}},
        consumer_train_name: {"params": {"input_size": len(consumer_features), "output_activation": "exp"}},
    },
    "ml_task": { 
        "type": "regression",
        "loss_function": "mse",
        "params": {},
    },
    "optimizer": {"name": "Adam", "params": {"learning_rate": 0.08}},
    "seed": 23,  # for reproducibility
    "influence_score": {"enable": True, "params": {}},
}

You can adjust the following model_config parameters as needed:

  • strategy - The aggregation strategy for federated training. Use the same value as the sample config.

  • model - The model structure for participating clients, with a separate dictionary for each client, keyed by the client name.

    • input_size: the number features in the input data for the corresponding client. NOTE: if a client only provides the target with no features, it should be excluded from this field.

    • output_activation: the function to apply to the output of the linear term. This is equivalent to the inverse of the link function in the context of GLM. Should be the same for all clients. Choose from the following:

      • None: Do not apply any function. This is the default value when not specified. Equivalent to the identity link function.

      • ”exp”: Apply the exponential function. Equivalent to the log link function.

      • “sigmoid”: Apply the sigmoid function. Equivalent to the logit link function.

  • ml_task - The parameters of the machine learning task. Choose from the following:

    • type: regression or classification

    • loss_function

      • logistic: for targets that follow a Bernoulli distribution

      • mse: for targets that follow a Gaussian distribution

      • poisson: for targets that follow a Poisson distribution

      • gamma: for targets that follow a Gamma distribution

      • inverseGaussian: for targets that follow an inverse Gaussian distribution

      • tweedie: for targets that follow a Tweedie distribution. The power of the Tweedie distribution can be specified under the params field. For example, to conduct a Tweedie regression with power 1.5, specify as follows:
        "ml_task": { "type": "tweedie", "params": {“power”: 1.5}, },

      • params: any additional parameters for the selected type and loss function, as shown in the tweedie example.

  • optimizer - The optimizer to use for the training process. We support all the PyTorch optimizers listed here, except for the LBFGS. Set name as the corresponding optimizer name, e.g., SGD, Adam. Extra parameters for the optimizer can be specified under the field params.

  • seed - Integer to be used as random seed, for reproducibility.

  • influence_score - Controls whether or not the influence score is calculated for the session. By default this is False. Set to True to enable this feature.

Data Config for VFL-GLM

The data_config specifies the structure of the input data for participating clients, with a separate dictionary for each client, keyed by the client name.

#Specify features:

consumer_features = ['building_age_norm',  
                     'region_code_AB', 'region_code_BC', 'region_code_NS', 'region_code_ON', 'region_code_QC',
                     'building_type_X', 'building_type_Y', 'building_type_Z', 
                     'industry_code_A', 'industry_code_B', 'industry_code_C']

vendor_features = ['fire_norm', 'crime_norm']

#Specify the data configuration 

data_config = {                                  
        acme_data: {
            "label_client": False,
            "predictors": vendor_features,
            "target": None,
        },
        consumer_train_name: {
            "label_client": True,
            "predictors": consumer_features,
            "target": target,
        },
    }

Each client dictionary contains the following fields:

  • label_client: binary flag (True/False), indicating whether the corresponding client is the one that provides label / target.

  • predictors: list of strings or integers, specifying the columns to use as features / predictors. If elements are strings, they represent the column names in the dataset. If elements are integers, they represent the column indices in the dataset (use only when you are confident about the column order in the data). If the corresponding client does not provide any feature, but only the target, set predictors as None.

  • target: a single string or integer, specifying the column to use as the label / target. Strings represent column names, and integers represent column indices (use only when you are confident about the column order in the data). If the corresponding client is not the label client, set target as None.

For more information, see the scikit documentation for Generalized Linear Models, or Generalized linear model on Wikipedia.

Create a VFL-GLM training session

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session. Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and two (2) rounds. It returns a session ID that you can use to track and reference your session.

# Set up the task builder 

from integrate_ai_sdk.taskgroup.taskbuilder.integrate_ai import IntegrateAiTaskBuilder
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

iai_tb_aws_consumer = IntegrateAiTaskBuilder(client=client, task_runner_id="")
iai_tb_aws_vendor1 = IntegrateAiTaskBuilder(client=client, task_runner_id="")
iai_tb_aws_vendor2 = IntegrateAiTaskBuilder(client=client, task_runner_id="")

Each client must have a unique name that matches the name specified in the data_config. For example, consumer_train_name and acme_data.

# Create and start a VFL training session for consumer + vendor

fl_train_session1 = client.create_vfl_session(
    name="Evaluation - VFL model Train 1",
    description="I am training a federated model with Acme data and the consumer data",
    prl_session_id=prl_session1.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=15,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config
).start()

vfl_task_group_context = (SessionTaskGroup(fl_train_session1)\
    .add_task(iai_tb_aws_consumer.vfl_train(train_dataset_name=consumer_train_name,
                                    test_dataset_name=consumer_test_name,
                                    client_name=consumer_train_name,
                                    batch_size=4000
                                    ))\
    .add_task(iai_tb_aws_provider.vfl_train(train_dataset_name=acme_data,
                                    test_dataset_name=acme_data,
                                    client_name=acme_data,
                                    batch_size=4000
                                    ))\
    .start())

fl_train_session1.id    #Prints the session ID for reference

Sessions may take some time to run depending on the compute environment. You can check the session status in the workspace, on the Sessions page. Wait for the status update to Completed.

View the training metrics

After the session completes successfully, you can view the training metrics and start making predictions.

  1. Retrieve the model metrics as_dict.

  2. Plot the metrics.

metrics = vfl_train_session.metrics().as_dict()
metrics
fig = vfl_train_session.metrics().plot()

VFL-GLM Prediction

After you have completed a successful PRL session and a VFL-GLM train session, you can use those two sessions to create a VFL-GLM prediction session.

In the example, the sessions are run in sequence, so the session IDs for the PRL and VFL train sessions are readily available to use in the predict session. If instead you run a PRL session and want to reuse the session later in a different VFL session, make sure that you save the session ID (prl_session.id). Then you can provide the session ID directly in the predict session setup instead of relying on the variable. The 3 sessions must use the same client_name and datasets in order to run successfully.

To create a VFL prediction session:

  1. Specify the PRL session ID (prl_session_id)

  2. Specify the VFL train session ID (training_session_id) from your previous succesful PRL and VFL sessions.

  3. Set the vfl_mode to predict.

# Create and start a VFL-GLM prediction session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL Predict",
    description="I am testing VFL Predict session creation with an AWS task runner through a notebook",
    prl_session_id=prl_session1.id,
    training_session_id=fl_train_session1.id,
    vfl_mode="predict",
    data_config=data_config
).start()

vfl_predict_session.id  # Prints the session ID for reference

Create and start VFL Predict session

Create and start a task group with one task for each of the clients joining the session

<!--BUG: EDGE-4745 - these two lines should not be needed but are on 9.21-->
# iai_tb_aws_consumer = IntegrateAiTaskBuilder(client=client, task_runner_id="demoshay")
# iai_tb_aws_provider = IntegrateAiTaskBuilder(client=client, task_runner_id="demoshay")

vfl_predict_task_group_context = (SessionTaskGroup(vfl_predict_session)\
.add_task(iai_tb_aws_consumer.vfl_predict(
        client_name=consumer_train_name, 
        dataset_name=consumer_train_name,
        raw_output=True,
        batch_size=1024))\
.add_task(iai_tb_aws_provider.vfl_predict(
        client_name=acme_data,
        dataset_name=acme_data,
        batch_size=1024,
        raw_output=True))\
).start()

The following parameters are required for each client task:

  • client_name - must be the same as the client name specified in the PRL and VFL train sessions

  • dataset_path - the name of a registered dataset

  • batch_size - set to a default value

  • raw_output - raw_output (bool, optional): whether the raw model output should be saved. Defaults to False, in which case, a transformation corresponding to the ml task is applied.

Sessions may take some time to run depending on the compute environment. You can check the session status in the workspace, on the Sessions page. Wait for the status update to Completed.

View VFL predictions

After the predict session completes successfully, you can view the predictions from the Active party and evaluate the performance.

# Retrieve the metrics
metrics = vfl_predict_session.metrics().as_dict()
metrics
import pandas as pd
presigned_result_urls = vfl_predict_session.prediction_result()

df_pred = pd.read_csv(presigned_result_urls.get(vfl_predict_active_storage_path))

df_pred.head()

Back to VFL Model Training