Gradient Boosted Models (HFL-GBM)

Gradient boosting is a machine learning algorithm for building predictive models that helps minimize the bias error of the model. The gradient boosting model provided by integrate.ai is an HFL model that uses the sklearn implementation of HistGradientBoostingClassifier for classifier tasks and HistGradientBoostingRegresssor for regression tasks.

The GBM sample notebook (integrateai_hfl_gbm.ipynb) provides sample code for running the SDK, and should be used in parallel with this tutorial. You can download the sample notebook here.

Sample HFL GBM Model Configuration and Data Schema

integrate.ai has a model class available for Gradient Boosted Models, called iai_gbm. This model is defined using a JSON configuration file during session creation.

The strategy for GBM is HistGradientBoosting.

# Model configuration

model_config = {
    "strategy": {
        "name": "HistGradientBoosting",
        "params": {}
    },
    "model": {
        "params": {
            "max_depth": 4, 
            "learning_rate": 0.05,
            "random_state": 23, 
            "max_bins": 128,
            "sketch_relative_accuracy": 0.001,
            "max_iter": 100,
        }
    },

    "ml_task": {"type": "classification", "params": {}}, 
    
    "save_best_model": {
        "metric": None,
        "mode": "min"
    },
  }

# Data Schema

  data_schema = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

You can adjust the following parameters as needed:

  • max_depth - Used to control the size of the trees.

  • learning_rate - (shrinkage) Used as a multiplicative factor for the leaves values. Set this to one (1) for no shrinkage.

  • max_bins - The number of bins used to bin the data. Using less bins acts as a form of regularization. It is generally recommended to use as many bins as possible.

  • sketch_relative_accuracy - Determines the precision of the sketch technique used to approximate global feature distributions, which are used to find the best split for tree nodes.

  • max_iter - The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built. The default is 100.

Set the machine learning task type to either classification or regression.

Specify any parameters associated with the task type in the params section.

The save_best_model option allows you to set the metric and mode for model saving. By default this is set to None, which saves the model from the previous round, and min.

The notebook also provides a sample data schema. For the purposes of testing GBM, use the sample schema as shown.

Configure task runner and data paths

In this example we are using an AWS task runner to run the training session. You must specify the name, and the name.

Upload the sample data to your S3 bucket and update the HFL datapaths shown to match your upload location.

aws_taskrunner_profile = "<workspace>" # This is your workspace name
aws_taskrunner_name = "<taskrunner>" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

# Example datapaths. Make sure that the data you want to work with exists in the base_aws_bucket for your task runner.

train_path1 = f's3://{base_aws_bucket}/synthetic/train_silo0.parquet'
test_path1 = f's3://{base_aws_bucket}/synthetic/test.parquet'
train_path2 = f's3://{base_aws_bucket}/synthetic/train_silo1.parquet'
test_path2 = f's3://{base_aws_bucket}/synthetic/test.parquet'

#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'

base_aws_bucket #Prints the base_aws_bucket name for reference

Create an HFL-GBM training session

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session.

Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and ten rounds (or trees). It returns a session ID that you can use to track and reference your session.

# Create a session

gbm_session = client.create_fl_session(
    name="HFL session testing GBM",
    description="I am testing GBM session creation through a notebook",
    min_num_clients=2,
    num_rounds=10,
    package_name="iai_gbm",
    model_config=model_config,
    data_config=data_schema,
).start()

gbm_session.id      //prints the session ID, for reference

Join the training session

The next step is to join the session with the sample data. This example has data for two datasets simulating two clients, as specified with the min_num_clients argument. The session begins training once the minimum number of clients have joined the session.

# Add clients to the training session

gbm_task_group_context = (
        SessionTaskGroup(gbm_session) \
        .add_task(iai_tb_aws.hfl(train_path=train_path1, test_path=test_path1))\
        .add_task(iai_tb_aws.hfl(train_path=train_path2, test_path=test_path2))\
        .start()
    )
  • gbm_session.id is the ID returned by the previous step

  • train-path is the path to and name of the sample dataset file

  • test-path is the path to and name of the sample test file

  • batch-size is the size of the batch of data

  • client-name is the unique name for each client

Check the task group status

Sessions may take some time to run depending on the computer environment. In the sample notebook and this tutorial, we poll the server to determine the session status.

for i in gbm_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

gbm_task_group_context.monitor_task_logs()

A session is complete when True is returned from the wait function:

gbm_task_group_context.wait(60*5, 2)

View the HFL-GBM training metrics

After the session completes successfully, you can view the training metrics and start making predictions.

  1. Retrieve the model metrics as_dict.

  2. Plot the metrics.

Retrieve the model metrics

gbm_session.metrics().as_dict()

#Example metrics output:

{'session_id': '9fb054bc24',
 'federated_metrics': [{'loss': 0.6876291940808297},
  {'loss': 0.6825978879332543},
  {'loss': 0.6780059585869312},
  {'loss': 0.6737175708711147},
  {'loss': 0.6697578155398369},
  {'loss': 0.6658972384035587},
  {'loss': 0.6623568458259106},
  {'loss': 0.6589517279565335},
  {'loss': 0.6556690569519996},
  {'loss': 0.6526266353726387},
  {'loss': 0.6526266353726387}],
 'rounds': [{'user info': {'test_loss': 0.6817054933875072,
    'test_roc_auc': 0.6868823702288674,
    'test_accuracy': 0.7061688311688312,
    'test_num_examples': 8008},
   'user info': {'test_accuracy': 0.5720720720720721,
    'test_num_examples': 7992,
    'test_roc_auc': 0.6637941733389123,
    'test_loss': 0.6935647668830733}},
  {'user info': {'test_accuracy': 0.5754504504504504,
    'test_roc_auc': 0.6740578481919338,
    'test_num_examples': 7992,
    'test_loss': 0.6884922753070576},
   'user info': {'test_loss': 0.6767152831608197,
...
   'user info': {'test_loss': 0.6578156923815811,
    'test_num_examples': 7992,
    'test_roc_auc': 0.7210704078520924,
    'test_accuracy': 0.6552802802802803}}],
 'latest_global_model_federated_loss': 0.6526266353726387}

Plot the metrics

fig = gbm_session.metrics().plot()

Trained model parameters are accessible from the completed session

Model parameters can be retrieved using the model’s as_sklearn method.

model = gbm_session.model().as_sklearn()

model

Working with your existing data

Now that you have downloaded the trained model, you can use it to make predictions on your own data.

Below are example methods you could call on the model.

test_data = pd.read_parquet(f"{data_path}/test.parquet")

test_data.head()

Convert test data to tensors

Y = test_data["y"]

X = test_data[["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"]]

Run model predictions

model.predict(X)

Result: array([0, 1, 0, ..., 0, 0, 1])

from sklearn.metrics import roc_auc_score

y_hat = model.predict_proba(X)

roc_auc_score(Y, y_hat[:, 1])

Result: 0.7082738332738332

Note: When the training sample sizes are small, this model is more likely to overfit the training data.

Back to HFL model types