Gradient Boosted Models (HFL-GBM)

Gradient Boosted Models (HFL-GBM)¶

Gradient boosting is a machine learning algorithm for building predictive models that helps minimize the bias error of the model. The gradient boosting model provided by integrate.ai is an HFL model that uses the sklearn implementation of HistGradientBoostingClassifier for classifier tasks and HistGradientBoostingRegresssor for regression tasks.

Sample HFL GBM Model Configuration and Data Schema¶

integrate.ai has a model class available for Gradient Boosted Models, called iai_gbm. This model is defined using a JSON configuration file during session creation.

The strategy for GBM is HistGradientBoosting.

# Model configuration

model_config = {
    "strategy": {
        "name": "HistGradientBoosting",
        "params": {}
    },
    "model": {
        "params": {
            "max_depth": 4, 
            "learning_rate": 0.05,
            "random_state": 23, 
            "max_bins": 128,
            "sketch_relative_accuracy": 0.001,
            "max_iter": 100,
        }
    },

    "ml_task": {"type": "classification", "params": {}}, 
    
    "save_best_model": {
        "metric": None,
        "mode": "min"
    },
  }

# Data Schema

  data_schema = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

You can modify the following parameters as needed:

max_depth - Used to control the size of the trees.
learning_rate - (shrinkage) Used as a multiplicative factor for the leaves values. Set this to one (1) for no shrinkage.
max_bins - The number of bins used to bin the data. Using less bins acts as a form of regularization. It is generally recommended to use as many bins as possible.
sketch_relative_accuracy - Determines the precision of the sketch technique used to approximate global feature distributions, which are used to find the best split for tree nodes.
max_iter - The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built. The default is 100.

Set the machine learning task type (ml_task) to either classification or regression. Specify any parameters associated with the task type in the params section.

The save_best_model option allows you to set the metric and mode for model saving. By default this is set to None, which saves the model from the previous round, and min.

Specify dataset names

Use variables to specify the names of your registered datasets. The names must match the names given during dataset registration in the UI.

consumer_train_name = 'consumer_train'
consumer_test_name = 'consumer_test'
provider_train_name = 'acme_train' 

Create an HFL-GBM training session¶

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session.

Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and ten rounds (or trees). It returns a session ID that you can use to track and reference your session.

# Create the task builder
from integrate_ai_sdk.taskgroup.taskbuilder.integrate_ai import IntegrateAiTaskBuilder
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

iai_tb_aws = IntegrateAiTaskBuilder(client=client,task_runner_id="")

# Create a session

gbm_session = client.create_fl_session(
    name="HFL session testing GBM",
    description="I am testing GBM session creation through a notebook",
    min_num_clients=2,
    num_rounds=10,
    package_name="iai_gbm",
    model_config=model_config,
    data_config=data_schema,
).start()

gbm_session.id      //prints the session ID, for reference

Join the training session¶

The next step is to join the session with the sample data. This example has data for two datasets simulating two clients, as specified with the min_num_clients argument. The session begins training once the minimum number of clients have joined the session.

# Add clients to the training session

gbm_task_group_context = (
        SessionTaskGroup(gbm_session) \
        .add_task(iai_tb_aws.hfl(train_path=consumer_train_name, test_path=consumer_test_name))\
        .add_task(iai_tb_aws.hfl(train_path=provider_train_name, test_path=provider_train_name))\
        .start()
    )

Check the task group status¶

Sessions may take some time to run depending on the computer environment. In the sample notebook and this tutorial, we poll the server to determine the session status.

for i in gbm_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

gbm_task_group_context.monitor_task_logs()

A session is complete when True is returned from the wait function:

gbm_task_group_context.wait(60*5, 2)

View the HFL-GBM training metrics¶

After the session completes successfully, you can view the training metrics and start making predictions.

Retrieve the model metrics as_dict.
Plot the metrics.

Retrieve the model metrics

gbm_session.metrics().as_dict()

#Example metrics output:

{'session_id': '9fb054bc24',
 'federated_metrics': [{'loss': 0.6876291940808297},
  {'loss': 0.6825978879332543},
  {'loss': 0.6780059585869312},
  {'loss': 0.6737175708711147},
  {'loss': 0.6697578155398369},
  {'loss': 0.6658972384035587},
  {'loss': 0.6623568458259106},
  {'loss': 0.6589517279565335},
  {'loss': 0.6556690569519996},
  {'loss': 0.6526266353726387},
  {'loss': 0.6526266353726387}],
 'rounds': [{'user info': {'test_loss': 0.6817054933875072,
    'test_roc_auc': 0.6868823702288674,
    'test_accuracy': 0.7061688311688312,
    'test_num_examples': 8008},
   'user info': {'test_accuracy': 0.5720720720720721,
    'test_num_examples': 7992,
    'test_roc_auc': 0.6637941733389123,
    'test_loss': 0.6935647668830733}},
  {'user info': {'test_accuracy': 0.5754504504504504,
    'test_roc_auc': 0.6740578481919338,
    'test_num_examples': 7992,
    'test_loss': 0.6884922753070576},
   'user info': {'test_loss': 0.6767152831608197,
...
   'user info': {'test_loss': 0.6578156923815811,
    'test_num_examples': 7992,
    'test_roc_auc': 0.7210704078520924,
    'test_accuracy': 0.6552802802802803}}],
 'latest_global_model_federated_loss': 0.6526266353726387}

Plot the metrics¶

fig = gbm_session.metrics().plot()

Trained model parameters are accessible from the completed session¶

Model parameters can be retrieved using the model’s as_sklearn method.

model = gbm_session.model().as_sklearn()

model

Working with your existing data¶

Now that you have downloaded the trained model, you can use it to make predictions on your own data.

Below are example methods you could call on the model.

test_data = pd.read_parquet(f"{data_path}/test.parquet")

test_data.head()

Convert test data to tensors

Y = test_data["y"]

X = test_data[["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"]]

Run model predictions

model.predict(X)

Result: array([0, 1, 0, ..., 0, 0, 1])

from sklearn.metrics import roc_auc_score

y_hat = model.predict_proba(X)

roc_auc_score(Y, y_hat[:, 1])

Result: 0.7082738332738332

Note: When the training sample sizes are small, this model is more likely to overfit the training data.

Back to HFL model types