Generalized Linear Models for VFL (VFL-GLM)

Generalized Linear Models for VFL (VFL-GLM)¶

Run a PRL session to determine overlap¶

Before you run a VFL session, you must first run a PRL session to determine the overlap between the datasets.

For more information, see Private Record Linkage (PRL) Sessions.

Review the sample data and model configuration for VFL-GLM¶

Example dataset names. Make sure that the datasets you want to work with are registered in your workspace.

consumer_train_name = 'consumer_train'
consumer_test_name = 'consumer_test'
provider_train_name = 'acme_train' 

Model Config for VFL-GLM¶

integrate.ai has a model class available for Generalized Linear Models, called iai_glm. This model is defined using a JSON configuration file during session creation (model_config).

model_config = {                                       
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        acme_data: {"params": {"input_size": len(vendor_features), "output_activation": "exp"}},
        consumer_train_name: {"params": {"input_size": len(consumer_features), "output_activation": "exp"}},
    },
    "ml_task": { 
        "type": "regression",
        "loss_function": "mse",
        "params": {},
    },
    "optimizer": {"name": "Adam", "params": {"learning_rate": 0.08}},
    "seed": 23,  # for reproducibility
    "influence_score": {"enable": True, "params": {}},
}

You can adjust the following model_config parameters as needed:

strategy - The aggregation strategy for federated training. Use the same value as the sample config.
model - The model structure for participating clients, with a separate dictionary for each client, keyed by the client name.
- input_size: the number of features in the input data for the corresponding client. NOTE: if a client only provides the target with no features, it should be excluded from this field.
- output_activation: the function to apply to the output of the linear term. This is equivalent to the inverse of the link function in the context of GLM. Use the same function for all clients. Choose from the following:
  - None: Do not apply any function. This is the default value when not specified. Equivalent to the identity link function.
  - ”exp”: Apply the exponential function. Equivalent to the log link function.
  - “sigmoid”: Apply the sigmoid function. Equivalent to the logit link function.

ml_task - The parameters of the machine learning task. Choose from the following:
- type: regression or classification
- loss_function
  - logistic: for targets that follow a Bernoulli distribution
  - mse: for targets that follow a Gaussian distribution
  - poisson: for targets that follow a Poisson distribution
  - gamma: for targets that follow a Gamma distribution
  - inverseGaussian: for targets that follow an inverse Gaussian distribution
  - tweedie: for targets that follow a Tweedie distribution. The power of the Tweedie distribution can be specified under the params field. For example, to conduct a Tweedie regression with power 1.5, specify as follows:
    "ml_task": { "type": "tweedie", "params": {“power”: 1.5}, },
  - params: any additional parameters for the selected type and loss function, as shown in the tweedie example.

optimizer - The optimizer to use for the training process. We support all the PyTorch optimizers listed here, except for the LBFGS. Set name as the corresponding optimizer name, e.g., SGD, Adam. Extra parameters for the optimizer can be specified under the field params.
seed - Integer to be used as random seed, for reproducibility.
influence_score - Controls whether or not the influence score is calculated for the session. By default this is False. Set to True to enable this feature.

Data Config for VFL-GLM¶

The data_config specifies the structure of the input data for participating clients, with a separate dictionary for each client, keyed by the client name.

#Specify features:

consumer_features = ['building_age_norm',  
                     'region_code_AB', 'region_code_BC', 'region_code_NS', 'region_code_ON', 'region_code_QC',
                     'building_type_X', 'building_type_Y', 'building_type_Z', 
                     'industry_code_A', 'industry_code_B', 'industry_code_C']

vendor_features = ['fire_norm', 'crime_norm']

#Specify the data configuration 

data_config = {                                  
        acme_data: {
            "label_client": False,
            "predictors": vendor_features,
            "target": None,
        },
        consumer_train_name: {
            "label_client": True,
            "predictors": consumer_features,
            "target": target,
        },
    }

Each client dictionary contains the following fields:

label_client: binary flag (True/False), indicating whether the corresponding client is the one that provides label / target.
predictors: list of strings or integers, specifying the columns to use as features / predictors. If elements are strings, they represent the column names in the dataset. If elements are integers, they represent the column indices in the dataset (use only when you are confident about the column order in the data). If the corresponding client does not provide any feature, but only the target, set predictors as None.
target: a single string or integer, specifying the column to use as the label / target. Strings represent column names, and integers represent column indices (use only when you are confident about the column order in the data). If the corresponding client is not the label client, set target as None.

For more information, see the scikit documentation for Generalized Linear Models, or Generalized linear model on Wikipedia.

Create a VFL-GLM training session¶

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session. Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and two (2) rounds. It returns a session ID that you can use to track and reference your session.

# Set up the task builder 

from integrate_ai_sdk.taskgroup.taskbuilder.integrate_ai import IntegrateAiTaskBuilder
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

iai_tb_aws_consumer = IntegrateAiTaskBuilder(client=client, task_runner_id="")
iai_tb_aws_provider = IntegrateAiTaskBuilder(client=client, task_runner_id="")

Each client must have a unique name that matches the name specified in the data_config. For example, consumer_train_name and acme_data.

# Create and start a VFL training session for consumer + vendor

fl_train_session = client.create_vfl_session(
    name="Evaluation - VFL model Train 1",
    description="I am training a federated model with Acme data and the consumer data",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,                  //Must be greater than 1
    num_rounds=15,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config
).start()

vfl_task_group_context = (SessionTaskGroup(fl_train_session)\
    .add_task(iai_tb_aws_consumer.vfl_train(train_dataset_name=consumer_train_name,
                                    test_dataset_name=consumer_test_name,
                                    client_name=consumer_train_name,
                                    batch_size=4000
                                    ))\
    .add_task(iai_tb_aws_provider.vfl_train(train_dataset_name=acme_data,
                                    test_dataset_name=acme_data,
                                    client_name=acme_data,
                                    batch_size=4000
                                    ))\
    .start())

fl_train_session.id    #Prints the session ID for reference

Sessions may take some time to run depending on the compute environment. You can check the session status in the workspace, on the Sessions page. Wait for the status update to Completed.

View the training metrics¶

After the session completes successfully, you can view the training metrics and start making predictions.

Retrieve the model metrics as_dict.
Plot the metrics.

metrics = vfl_train_session.metrics().as_dict()
metrics

fig = vfl_train_session.metrics().plot()

VFL-GLM Prediction¶

After you have completed a successful PRL session and a VFL-GLM train session, you can use those two sessions to create a VFL-GLM prediction session.

In the example, the sessions are run in sequence, so the session IDs for the PRL and VFL train sessions are readily available to use in the predict session. If instead you run a PRL session and want to reuse the session later in a different VFL session, make sure that you save the session ID (prl_session.id). Then you can provide the session ID directly in the predict session setup instead of relying on the variable. The 3 sessions must use the same client_name and datasets in order to run successfully.

To create a VFL prediction session:

Specify the PRL session ID (prl_session_id)
Specify the VFL train session ID (training_session_id) from your previous succesful PRL and VFL sessions.
Set the vfl_mode to predict.

# Create and start a VFL-GLM prediction session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL Predict",
    description="I am testing VFL Predict session creation with an AWS task runner through a notebook",
    prl_session_id=prl_session1.id,
    training_session_id=fl_train_session1.id,
    vfl_mode="predict",
    data_config=data_config
).start()

vfl_predict_session.id  # Prints the session ID for reference

Create and start VFL Predict session¶

Create and start a task group with one task for each of the clients joining the session

# Create the task builder
from integrate_ai_sdk.taskgroup.taskbuilder.integrate_ai import IntegrateAiTaskBuilder
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

iai_tb_aws_consumer = IntegrateAiTaskBuilder(client=client, task_runner_id="")
iai_tb_aws_provider = IntegrateAiTaskBuilder(client=client, task_runner_id="")

# Create the session
vfl_predict_task_group_context = (SessionTaskGroup(vfl_predict_session)\
.add_task(iai_tb_aws_consumer.vfl_predict(
        client_name=consumer_train_name, 
        dataset_name=consumer_train_name,
        raw_output=True,
        batch_size=1024))\
.add_task(iai_tb_aws_provider.vfl_predict(
        client_name=acme_data,
        dataset_name=provider_train_name,
        batch_size=1024,
        raw_output=True))\
).start()

The following parameters are required for each client task:

client_name - must be the same as the client_name specified in the PRL and VFL train sessions
dataset_path - the variable name of a registered dataset
batch_size - set to 1024 by default, but can be increased for larger batches
raw_output (bool, optional) - whether the raw model output should be saved. Defaults to False, in which case a transformation corresponding to the ml_task is applied.

Sessions may take some time to run depending on the compute environment. You can check the session status in the workspace, on the Sessions page. Wait for the status to update to Completed.

View VFL predictions¶

After the predict session completes successfully, you can view the predictions from the Active party and evaluate the performance.

# Retrieve the metrics
metrics = vfl_predict_session.metrics().as_dict()
metrics

import pandas as pd
presigned_result_urls = vfl_predict_session.prediction_result()

df_pred = pd.read_csv(presigned_result_urls.get(vfl_predict_active_storage_path))

df_pred.head()

Back to VFL Model Training