Welcome to integrate.ai

integrate.ai is the safe data collaboration platform that enables data science leaders to activate data silos by moving models, not data.

Unlock quality data at scale

Collaborate within and across organizations, jurisdictions, and environments without moving data.

Reduce collaboration effort

Efficiently collaborate without copying, centralizing, or transferring data directly.

Protect privacy & trust

Avoid compliance hurdles through a trusted platform with baked-in security, safety, and privacy

Developer tools for federated learning and analytics

Machine learning and analytics on distributed data require complex infrastructure and tooling that drain valuable developer resources. integrate.ai abstracts away the complexity, making it easier than ever to integrate federated learning and analytics into your product.

Train deep learning models, GLMs, CNNs, LSTMs, Transformers, FFNNs, decision trees, and more.
Calculate statistics and generate visualizations to understand datasets individually and in aggregate.
Run data cleaning and generate new features on decentralized data to prepare for model training.

System Overview

integrate.ai provides a robust and seamless system that removes the infrastructure and administration challenges of distributed machine learning and enables you to integrate it directly into your existing workflow. With robust privacy-enhancing technologies like federated learning and differential privacy, integrate.ai allows you to train models and run analytics across distributed data without compromising data privacy.

Our platform consists of the following components:

An SDK with a REST API that exposes a full suite of data science tools for privacy-safe analytics and AI across your intelligence network. It also supports notebook-based ad hoc discovery and exploratory analysis.
A Task Server that automatically orchestrates federated analytics and AI tasks across connected data nodes.
Task runners that apply governance controls, made data nodes discoverable, and execute analytics and AI tasks.

QUICKSTART

The following guide is meant to walk you through installing integrate.ai, in 3 easy steps:

Set up your Workspace
Set up your Cloud Environment
Register datasets

Set up your Workspace

A workspace represents a private customer-controlled environment within integrate.ai. Depending on their roles, users in a workspace can register data products and run evaluation jobs on that data.

Note: Keep track of your workspace-name: it is the text located after the “login-” component of your workspace URL (e.g., login-<workspace-name>.integrateai.net). Reach out to your Customer Success Manager if you need to confirm your workspace-name.

Activate Workspace Administrator:

To set up your Administrator account, first provide your integrate.ai Customer Success Manager with the full name and work email address of your Admin user. As Admin you can add and remove users, and have access to all controls and functionalities of the integrate.ai platform.

Once a Customer Success Manager has created an account for you, you will receive an invitation email from support@mail.integrateai.net to activate your account.

Follow the instructions provided in the email to complete activation.

Invite additional users:

The Administrator can decide which users can access their workspace and specify their roles.

The current user roles are: Administrator, Data Custodian, and Model Builder. These roles are defined below.

Data Custodian: only has the ability to register datasets to the workspace.
Model Builder: has the ability to register datasets, and run data science and evaluation jobs on these datasets in the workspace.
Administrator: has full access to all platform capabilities within the workspace, including the ability to add, remove, or modify users and their role; add or remove datasets; and run data science and evaluation jobs on the registered datasets.

Follow these steps to invite additional users to your workspace:

In the side navigation bar, click Settings -> Members -> Invite Members.
Enter the details of the user and select the appropriate role.
Click Invite member at the bottom of the popup. An email is sent to the member to activate their account.

Set up your Cloud Environment

It's now time to set up your cloud environment and register a task runner. Once a task runner has been provisioned, your data product can easily be shared without having to move it out of your existing cloud environment.

Follow the steps listed in the IT Administrator workflow section to set up your appropriate cloud environment.

Register a task runner

You can now register a task runner in your workspace so that you can register a dataset with it.

To do so, log in to the workspace with your account and complete the following steps:

In the side navigation bar, click Settings -> Task Runners -> Register.
In the pop-up, select the appropriate cloud that matches your infrastructure:
- AWS Task Runner
- Azure Task Runner

Register datasets

Registering your data allows you and other partners to collaborate on data science tasks.

Prepare your data product(s)

Before registering your data, you must prepare your data product to enable evaluation jobs.

Your data product(s) must be engineered to meet the the criteria laid out in Data Requirements.

Note: If your data is already engineered out-of-the-box, skip to Register your data producst(s) below. However, if you need to do additional processing, it would be most useful for the Data Consumer to keep the original columns, and append the feature-engineered column with this nomenclature: <original_column_name>_processed.

Once the above criteria are met, your data is ready for registration.

Register your data product(s)

Next, register your data through your workspace by selecting a task runner, specifying the dataset URI, and uploading associated metadata.

AWS cloud environment
Azure cloud environment

After registration, your dataset is available in the Library page.

Data Provider Quickstart

Once your data product(s) are registered, your final step is to complete evaluation template notebooks and connect a data partner, so that they can conduct an evaluation of your data.

Complete evaluation template

Data Consumers perform data evaluation with integrate.ai to understand how valuable a Data Provider's data product could be to their data science task(s) or business problem(s).

There are two core questions Data Consumers typically seek to answer when evaluating a model or data product:

How much data from a data product is usable in reference to my internal data?
How relevant or useful is the data product to my data science task, use case, or project?

integrate.ai’s evaluation templates enable you to provide a simple, guided method for your customers (i.e., Data Consumers) to answer these evaluation questions, by performing data science jobs on your data product(s).

Creating evaluation templates

You can collaborate with your Customer Success Manager to build out evaluation templates on your data products.

Follow these steps to start building out your templates:

Provide the following inputs to your Customer Success Manager:
- A definition of the data products (number of datasets, dictionaries, and supporting material)
- List of target variables that the data products can be used for
- Data product target variable mapping (see example table below)
  Data product target variable mapping example:
Feature 1 Feature 2 Feature N

Target 1 X X

Target 2 X X

Target N X
integrate.ai will provide you with with pre-built evaluation templates in the form of a Jupyter notebook, using the information above.
Configure the template for your data product by following the instructions in the template.
Upload these notebooks as part of your data product registration step to make them available to Data Consumers.

	Feature 1	Feature 2	Feature N
Target 1	X	X
Target 2	X	X
Target N			X

Data connection

Follow the steps in Data Connection to connect a data partner to view and use your data product

DATA CONNECTION

Connect a partner to your data

As a data provider, you register your datasets in your workspace to make them available to your own company. You can also connect a partner or data consumer to your data to allow them to evaluate it in secure, private, federated learning sessions.

To connect a partner:

In your Library, click a dataset name to select it.
On the upper right of the dataset description, click Connect Partner.
On the Connect data pop-up, fill in the following information:

a. Email - address of the partner you want to connect with.

b. Workspace name - the name of your partner's workspace. Contact your Customer Success Manager to confirm this value.

c. Message - the connecting message is auto-generated. You can customize it or add information.
Click Connect to send the email.

The data consumer clicks a link in the email they recieve to view the connected dataset in their Library.

Viewing data connections

A dataset with a connection has an icon on its description in the Library.

Click the dataset to see a list of connections in the dataset description (bottom of the right panel).

Note: Connected data consumers cannot see any other information about the connection. They cannot see other connections to the same dataset.

Disconnecting data

As the data provider, you retain full control over who is connected to your data.

To disconnect a data consumer:

Click the dataset in the Library.
Locate the connection at the bottom of the dataset description.
Click the trash can icon to disconnect the consumer from the data.
Confirm that you want to disconnect.

DEPLOYMENT

integrate.ai provides a safe data collaboration platform that enables data science leaders to activate data silos by moving models, not data. It democratizes usage of the cloud by taking care of all the computational aspects of deployment — from container orchestration and deployment, to security and failovers. By harnessing the power of cloud computing and AI we enable data scientists and R&D professionals, with limited-to-no computational and machine learning training, to analyse and make sense of their data in the fastest and safest way possible.

The image below illustrates the high-level system architecture and how the managed environments interact.

The task runners use the serverless capabilities of cloud environments (such as AWS Batch and Fargate). This greatly reduces the administration burden and ensures that resource cost is only incurred while task runners are in use.

IT Administrator Workflow

The initial setup for your workspace must be performed by an AWS or Microsoft Azure Administrator. This Administrator must have the rights neceesary to configure the permissions and access needed by the cloud-based task runners.

Follow the steps for either AWS configuration or Azure configuration to generate the roles and policies or service principal information.
Provide the cloud environment details to the integrate.ai Workspace Administrator.

a. For AWS, provide the provisioner and runtime role ARNs.

b. For Azure, provide the Resource group, Service principal ID, Service principal secret, Subscription ID, and Tenant ID.

Alternatively, the IT Admin may choose to take on the Workspace Administrator role in the workspace and configure one or more task runners for their users.

AWS configuration for task runners

Before you get started using integrate.ai in your cloud for training sessions, there are a few configuration steps that must be completed. You must grant integrate.ai permission to deploy task runner infrastructure in your cloud, by creating a limited permission Role in AWS for the provisioner and for the runtime agent. This is a one-time process - once created, you can use these roles for any task runners in your environment.

This section walks you through the required configuration.

Create a provisioner role and policy.
Create a runtime role and policy.

Create AWS Provisioner Policy

{
    "Version":"2012-10-17",
    "Statement":[
        {
            "Sid":"IAM",
            "Effect":"Allow",
            "Action":[
                "iam:CreateInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:DeleteInstanceProfile",
                "iam:AddRoleToInstanceProfile"
            ],
            "Resource":"arn:aws:iam::*:instance-profile/iai-*"
        },
        {
            "Sid":"IAMPass",
            "Effect":"Allow",
            "Action":[
                "iam:PassRole"
            ],
            "Resource":"arn:aws:iam::*:role/iai-*"
        },
        {
            "Sid":"IAMRead",
            "Effect":"Allow",
            "Action":[
                "iam:GetInstanceProfile",
                "iam:GetPolicy",
                "iam:GetRole",
                "iam:GetPolicyVersion",
                "iam:ListRolePolicies",
                "iam:ListAttachedRolePolicies",
                "iam:ListPolicyVersions",
                "iam:ListInstanceProfilesForRole"
            ],
            "Resource":"*"
        },
        {
            "Sid":"Batch",
            "Effect":"Allow",
            "Action":[
                "batch:RegisterJobDefinition",
                "batch:DeregisterJobDefinition",
                "batch:CreateComputeEnvironment",
                "batch:UpdateComputeEnvironment",
                "batch:DeleteComputeEnvironment",
                "batch:CreateJobQueue",
                "batch:UpdateJobQueue",
                "batch:DeleteJobQueue"
            ],
            "Resource":[
                "arn:aws:batch:*:*:compute-environment/iai-*",
                "arn:aws:batch:*:*:job-definition/iai-*",
                "arn:aws:batch:*:*:job-queue/iai-*",
                "arn:aws:batch:*:*:job-definition-revision/iai-*",
                "arn:aws:batch:*:*:scheduling-policy/*"
            ]
        },
        {
            "Sid":"BatchRead",
            "Effect":"Allow",
            "Action":[
                "batch:DescribeComputeEnvironments",
                "batch:DescribeJobDefinitions",
                "batch:DescribeJobQueues"
            ],
            "Resource":"*"
        },
        {
            "Sid":"CW",
            "Effect":"Allow",
            "Action":[
                "logs:ListTagsLogGroup",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:DeleteLogGroup",
                "logs:TagResource",
                "logs:PutLogEvents",
                "logs:PutRetentionPolicy"
            ],
            "Resource":[
                "arn:aws:logs:*:*:log-group:iai-*"
            ]
        },
        {
            "Sid":"CWRead",
            "Effect":"Allow",
            "Action":[
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams"
            ],
            "Resource":"*"
        },
        {
            "Sid":"ECSFargate",
            "Effect":"Allow",
            "Action":[
                "ecs:CreateCluster",
                "ecs:DescribeClusters",
                "ecs:DeleteCluster",
                "ecs:UpdateCluster",
                "ecs:RegisterTaskDefinition",
                "ecs:TagResource"
            ],
            "Resource":[
                "arn:aws:ecs:*:*:cluster/iai-*",
                "arn:aws:ecs:*:*:task-definition/iai-*"
            ]
        },
        {
            "Sid":"ECSFargateRead",
            "Effect":"Allow",
            "Action":[
                "ecs:DescribeTaskDefinition",
                "ecs:DeregisterTaskDefinition"
            ],
            "Resource":"*"
        },
        {
            "Sid":"Kms",
            "Effect":"Allow",
            "Action":[
                "kms:ListAliases",
                "kms:CreateKey",
                "kms:CreateAlias",
                "kms:DescribeKey",
                "kms:GetKeyPolicy",
                "kms:GetKeyRotationStatus",
                "kms:ListResourceTags",
                "kms:ScheduleKeyDeletion",
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:RevokeGrant",
                "kms:TagResource",
                "kms:Untagresource"
            ],
            "Resource":"*"
        },
        {
            "Sid":"KmsDeleteAlias",
            "Effect":"Allow",
            "Action":[
                "kms:DeleteAlias"
            ],
            "Resource":[
                "arn:aws:kms:*:*:alias/iai/*",
                "arn:aws:kms:*:*:key/*"
            ]
        },
        {
            "Sid":"S3Create",
            "Effect":"Allow",
            "Action":[
                "s3:CreateBucket",
                "s3:DeleteBucket",
                "s3:DeleteBucketPolicy",
                "s3:PutBucketVersioning",
                "s3:PutBucketPublicAccessBlock",
                "s3:PutEncryptionConfiguration"
            ],
            "Resource":[
                "arn:aws:s3:::*integrate.ai",
                "arn:aws:s3:::*integrate.ai/*"
            ]
        },
        {
            "Sid":"S3Read",
            "Effect":"Allow",
            "Action":[
                "s3:GetBucketCors",
                "s3:GetBucketPolicy",
                "s3:PutBucketPolicy",
                "s3:GetBucketWebsite",
                "s3:GetBucketVersioning",
                "s3:GetLifecycleConfiguration",
                "s3:GetAccelerateConfiguration",
                "s3:GetBucketRequestPayment",
                "s3:GetBucketLogging",
                "s3:GetBucketPublicAccessBlock",
                "s3:GetBucketAcl",
                "s3:GetBucketObjectLockConfiguration",
                "s3:GetReplicationConfiguration",
                "s3:GetBucketTagging",
                "s3:GetEncryptionConfiguration",
                "s3:ListBucket"
            ],
            "Resource":[
                "arn:aws:s3:::*integrate.ai",
                "arn:aws:s3:::*integrate.ai/*"
            ]
        },
        {
            "Sid":"VPCDescribe",
            "Effect":"Allow",
            "Action":[
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeVpcClassicLinkDnsSupport",
                "ec2:DescribeVpcClassicLink",
                "ec2:DescribeInternetGateways",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSecurityGroupRules",
                "ec2:DescribeRouteTables",
                "ec2:DescribeNetworkAcls",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeNatGateways",
                "ec2:DescribeAddresses",
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeAvailabilityZones"
            ],
            "Resource":"*"
        },
        {
            "Sid":"VPCCreate",
            "Effect":"Allow",
            "Action":[
                "ec2:CreateVpc",
                "ec2:CreateTags",
                "ec2:AllocateAddress",
                "ec2:ReleaseAddress",
                "ec2:CreateSubnet",
                "ec2:ModifySubnetAttribute",
                "ec2:RevokeSecurityGroupEgress",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:CreateRouteTable",
                "ec2:CreateRoute",
                "ec2:CreateInternetGateway",
                "ec2:AttachInternetGateway",
                "ec2:AssociateRouteTable",
                "ec2:ModifyVpcAttribute",
                "ec2:CreateSecurityGroup",
                "ec2:CreateNatGateway"
            ],
            "Resource":"*"
        },
        {
            "Sid":"VPCDelete",
            "Effect":"Allow",
            "Action":[
                "ec2:DeleteSubnet",
                "ec2:DisassociateRouteTable",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteRoute",
                "ec2:DeleteNatGateway",
                "ec2:DeleteRouteTable",
                "ec2:DisassociateAddress",
                "ec2:DetachInternetGateway",
                "ec2:DeleteInternetGateway",
                "ec2:DeleteVpc"
            ],
            "Resource":"*"
        }
    ]
}

This policy lists all of the required permissions for integrate.ai to create the necessary infrastructure.

The provisioner creates the following components and performs the required related tasks:

Configures AWS Batch infrastructure, including creating roles and policies
Configures AWS Fargate infrastructure, including creating roles and policies
Creates a VPC and compute environment - all compute runs in the VPC created for the task runner
Creates an S3 bucket that is encrypted with a customer key created by the provisioner
Pulls the required client and server container images from an integrate.ai ECR (elastic container repository)

Note: AWS imposes default maximums for some infrastructure components, as noted below. If you exceed these limits, provisioning a new task runner may fail. Request an increase to the limit through your AWS console.

VPCs - maximum 5 by default
EC2 NAT gateways - maximum 5 by default

To create the provisioner policy:

In the AWS Console, go to IAM, select Policies, and click Create policy.
On the Specify permissions page, click the JSON tab.
Paste in the Provisioner JSON policy provided by integrate.ai (shown in the code panel on the right).
Click Next.
Name the policy iai-provisioner-policy and click Create policy.

Create AWS Provisioner Role

# AWS Provisioner Custom trust policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::919740802015:root"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Condition": {
                "ArnLike": {
                    "aws:PrincipalArn": "arn:aws:iam::919740802015:role/iai-taskrunner-provision-*"
                }
            }
        }
    ]
}

You must grant integrate.ai access to deploy task runner infrastructure in your cloud, by creating a limited permission role in AWS for the provisioner.

To create the provisioner role:

In the left navigation bar of the console, under IAM, select Roles, and click Create role.
In Step 1 - Select Trusted Entity, click Custom trust policy.
Paste in the custom trust policy provided by integrate.ai (shown in the code panel on the right).
Click Next.
In Step 2 - Add permissions, search for and selected the policy you created (iai-provisioner-policy).
(Optional) If your environment requires a permission boundary, attach it to the role.
Click Next.
Provide the following Role name: iai-taskrunner-provisioner. Important: Do not edit or change this name.
Click Create role.

Copy and save the ARN for the provisioner role. Provide the ARN to any Workspace Administrator or Model Builder who will be creating task runners.

Create AWS Runtime policy

# AWS Runtime policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowBatchDescribeJobs",
            "Effect": "Allow",
            "Action": [
                "batch:DescribeJobs",
                "batch:TagResource"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowBatchAccess",
            "Effect": "Allow",
            "Action": [
                "batch:TerminateJob",
                "batch:SubmitJob",
                "batch:TagResource",
                "batch:CancelJob"
            ],
            "Resource": [
                "arn:aws:batch:*:*:job-definition/iai-fl-client-batch-job-*",
                "arn:aws:batch:*:*:job-queue/iai-fl-client-batch-job-queue-*",
                "arn:aws:batch:*:*:job/*"
            ]
        },
        {
            "Sid": "AllowECSUpdateAccess",
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeContainerInstances",
                "ecs:DescribeTasks",
                "ecs:ListTasks",
                "ecs:UpdateContainerAgent",
                "ecs:StartTask",
                "ecs:StopTask",
                "ecs:RunTask"
            ],
            "Resource": [
                "arn:aws:ecs:*:*:cluster/iai-fl-server-ecs-cluster-*",
                "arn:aws:ecs:*:*:task/iai-fl-server-ecs-cluster-*",
                "arn:aws:ecs:*:*:task-definition/iai-fl-server-fargate-job-*"
            ]
        },
        {
            "Sid": "AllowECSReadAccess",
            "Effect": "Allow",
            "Action": [
                "ecs:DescribeTaskDefinition"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Sid": "AllowPassRole",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::*:role/iai-fl-server-fargate-task-role-*-*",
                "arn:aws:iam::*:role/iai-fl-server-fargate-execution-role-*-*"
            ]
        },
        {
            "Sid": "AllowSSMAccessForTokens",
            "Effect": "Allow",
            "Action": [
                "ssm:PutParameter",
                "ssm:DescribeParameters",
                "ssm:GetParameters",
                "ssm:GetParameter",
                "ssm:DeleteParameter",
                "ssm:DeleteParameters"
            ],
            "Resource": [
                "arn:aws:ssm:*:*:parameter/fl-server-*-token",
                "arn:aws:ssm:*:*:parameter/fl-client-*-token"
            ]
        },
        {
            "Sid": "AllowS3Access",
            "Effect": "Allow",
            "Action": [
                "s3:*Object",
                "s3:ListBucket",
                "s3:GetObjectAcl",
                "s3:GetObjectVersion",
                "s3:ListBucketVersions",
                "s3:GetEncryptionConfiguration"
            ],
            "Resource": [
                "arn:aws:s3:::*.integrate.ai",
                "arn:aws:s3:::*.integrate.ai/*"
            ]
        },
        {
            "Sid": "DenyS3BucketReadAccess",
            "Effect": "Deny",
            "Action": [
                "s3:*Object",
                "s3:ListBucket",
                "s3:GetEncryptionConfiguration",
                "s3:GetObjectAcl",
                "s3:ListBucketVersions",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::*.integrate.ai/taskrunner",
                "arn:aws:s3:::*.integrate.ai/taskrunner/*"
            ]
        },
        {
            "Sid": "AllowKMSUsage",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:Encrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "*",
            "Condition": {
                "ForAnyValue:StringLike": {
                    "kms:ResourceAliases": "alias/iai/*"
                }
            }
        },
        {
            "Sid": "AllowLogGroupAndStreamAccess",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:CreateLogGroup",
                "logs:Describe*",
                "logs:List*",
                "logs:StartQuery",
                "logs:StopQuery",
                "logs:TestMetricFilter",
                "logs:FilterLogEvents",
                "logs:Get*"
            ],
            "Resource": [
                "arn:aws:logs:*:*:log-group:iai-fl-server-fargate-log-group-*:log-stream:*",
                "arn:aws:logs:*:*:log-group:/aws/batch/job:log-stream:iai-fl-client-batch-job-*",
                "arn:aws:logs:*:*:log-group:iai-proxy-log*:log-stream:*"
            ]
        },
        {
            "Sid": "AllowTaskInfo",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/edge*"
        }
    ]
}

You must grant integrate.ai access to run the task runner in your cloud environment by creating a limited permission role in AWS for the runtime agent.

Important: This policy provides the specific permissions for integrate.ai to call the task runner and dispatch tasks, such as training a machine learning model. You may review this policy to comply with your organization's policies. Contact your Customer Success Manager if this policy requires modification.

To create the runtime policy:

In the AWS Console, go to IAM, select Policies, and click Create policy.
On the Specify permissions page, click the JSON tab.
Paste in the Runtime JSON policy provided by integrate.ai (shown in the code panel at right).
Click Next.
Name the policy iai-runtime-policy and click Create policy.

Create AWS Runtime role

# AWS Runtime custom trust policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::919740802015:root"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Condition": {
                "ArnLike": {
                    "aws:PrincipalArn": [
                        "arn:aws:iam::919740802015:role/IAI-API-*"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "ecs-tasks.amazonaws.com",
                    "batch.amazonaws.com",
                    "ec2.amazonaws.com"
                ]
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Condition": {}
        }
    ]
}

In the left navigation bar of the console, under IAM, select Roles, and click Create role.
Select Custom trust policy.
Paste in the custom trust relationship JSON provided by integrate.ai (shown in the code panel on the right).
Click Next.
On the Add permissions page, search for and select the policy you just created (iai-runtime-policy).
Search for and add the following AWS policies: AmazonEC2ContainerServiceforEC2Role, AWSBatchServiceRole, AmazonECSTaskExecutionRolePolicy.
Click Next.
Provide the following Role name: iai-taskrunner-runtime. Important: Do not edit or change this name.
Click Create role.

Copy and save the ARN for the runtime role. Provide the ARN to any Workspace Administrator or Model Builder who will be creating task runners.

Role and policy configuration is now complete.

Back to Quickstart

Azure configuration for task runners

Before you get started using integrate.ai in your Microsoft Azure cloud for training sessions, there are a few configuration steps that must be completed:

You must grant integrate.ai permission to deploy task runner infrastructure in your cloud, by creating the following:

A dedicated resource group
A limited permission provisioner service principal, used to provision a taskrunner.
A limited permission runtime service principal to execute Azure tasks using the task runner.

This is a one-time process - once created, you can use this infrastructure for any task runners in your environment.

This section walks you through the required configuration.

Create a resource group.
Create a custom provisioner role.
Create a provisioner service principal.
Create a custom runtime role.
Create a runtime service principal and role.

Create Azure Resource Group

You must grant integrate.ai access to deploy task runner infrastructure in your cloud, by creating a dedicated resource group. You must provide the resource group name as part of the task runner creation process.

In order to provide all the necessary permissions, the user who creates the resource group and provisioner service principal must be an Azure AD Administrator.

Log in to your Azure portal and create a dedicated resource group for use with integrate.ai. For more information about resource groups, see the Microsoft documentation.

Create Custom Provisioner Role for Azure

# Azure Provisioner Role 

{
    "properties": {
      "roleName": "iai-provisoner",
      "description": "Permission polices to provision a task runner",
      "assignableScopes": [
        "/"
      ],
      "permissions": [
        {
          "actions": [
            "Microsoft.Resources/subscriptions/resourceGroups/read",
            "Microsoft.Storage/storageAccounts/read",
            "Microsoft.Storage/storageAccounts/write",
            "Microsoft.Storage/storageAccounts/delete",
            "Microsoft.Storage/storageAccounts/fileServices/read",
            "Microsoft.Storage/storageAccounts/fileServices/shares/read",
            "Microsoft.Storage/storageAccounts/blobServices/containers/write",
            "Microsoft.Storage/storageAccounts/blobServices/containers/read",
            "Microsoft.Storage/storageAccounts/blobServices/containers/delete",
            "Microsoft.Storage/storageAccounts/blobServices/write",
            "Microsoft.Storage/storageAccounts/blobServices/read",
            "Microsoft.Storage/storageAccounts/listKeys/action",
            "Microsoft.ClassicStorage/storageAccounts/listKeys/action",
            "Microsoft.OperationalInsights/workspaces/read",
            "Microsoft.OperationalInsights/workspaces/write",
            "Microsoft.OperationalInsights/workspaces/delete",
            "Microsoft.OperationalInsights/workspaces/sharedKeys/read",
            "Microsoft.OperationalInsights/workspaces/sharedKeys/action",
            "Microsoft.OperationalInsights/workspaces/tables/write",
            "Microsoft.OperationalInsights/workspaces/tables/read",
            "Microsoft.OperationalInsights/workspaces/tables/delete",
            "Microsoft.OperationalInsights/workspaces/tables/query/read",
            "Microsoft.Insights/dataCollectionRules/read",
            "Microsoft.Insights/dataCollectionRules/write",
            "Microsoft.Insights/DataCollectionRules/Delete",
            "Microsoft.Insights/dataCollectionEndpoints/read",
            "Microsoft.Insights/dataCollectionEndpoints/write",
            "Microsoft.Insights/dataCollectionEndpoints/delete"
          ],
          "notActions": [
          ],
          "dataActions": [
            "Microsoft.Storage/storageAccounts/fileServices/fileshares/files/read",
            "Microsoft.Storage/storageAccounts/fileServices/fileshares/files/write",
            "Microsoft.Storage/storageAccounts/fileServices/fileshares/files/delete",
            "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete",
            "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read",
            "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write",
            "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/move/action",
            "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action",
            "Microsoft.Storage/storageAccounts/fileServices/readFileBackupSemantics/action",
            "Microsoft.OperationalInsights/workspaces/tables/data/read"

          ],
          "notDataActions": []
        }
      ]
    }
  }

This role lists all of the required permissions for integrate.ai to create the necessary infrastructure.

Use the provided JSON example to create a custom provisioner role in your resource group.

In the Azure portal, select the Resource Groups service and select the resource group created in Step 1.
Select the Access Control (IAM) section and click Add Custom Role from the drop down.
Select the JSON tab. Copy and paste the permissions code block shown in the code panel. Update the code block to use the correct resource group in assignableScopes:

"assignableScopes": ["/subscriptions/<SUBSCRIPTION ID>/resourceGroups/<RESOURCE GROUP NAME>"]

4. Click Review + Create.

Create Provisioner Service Principal

# Example CLI output:

Creating custom role assignment under scope '/subscriptions/<your subscription ID>/resourcegroups/<your resource group>'
The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli
{
  "appId": "<client ID>",
  "displayName": "azure-cli-2023-04-13-14-57-09",
  "password": "<secret>",
  "tenant": "<tenant ID>"
}

If it is not already available, install the Azure CLI. For more information, see the Azure CLI documentation.
At the command prompt, type:

az ad sp create-for-rbac --name taskrunner-provisioner-sp --role "iai-provisioner-role" --scopes="/subscriptions/<your subscription ID>/resourcegroups/<your resource group>"

Make sure that you specify the correct subscription ID and resource group name.
Copy the output and save it. This information is required to connect a new task runner.

Note: The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see Microsoft documentation.

Assign the Reader role to the provisioner service principal. The user who creates the resource group and provisioner service principal must be an Azure AD Administrator.

The provisioner creates the following components and performs the required related tasks:

Configures infrastructure
Pulls the required client and server container images from an integrate.ai ECR (elastic container repository).

Create Custom Runtime Role for Azure

{
  "properties": {
    "roleName": "iai-runtime",
    "description": "Permission polices to run a task under the task runner",
    "assignableScopes": [
      "/"
    ],
    "permissions": [
      {
        "actions": [
          "Microsoft.Network/virtualNetworks/subnets/join/action",
          "Microsoft.ContainerInstance/register/action",
          "Microsoft.ContainerInstance/containerGroups/read",
          "Microsoft.ContainerInstance/containerGroups/write",
          "Microsoft.ContainerInstance/containerGroups/delete",
          "Microsoft.ContainerInstance/containerGroups/containers/exec/action",
          "Microsoft.ContainerInstance/containerGroups/containers/attach/action",
          "Microsoft.ContainerInstance/containerGroups/containers/buildlogs/read",
          "Microsoft.ContainerInstance/containerGroups/containers/logs/read",
          "Microsoft.ContainerInstance/containerGroups/detectors/read",
          "Microsoft.ContainerInstance/containerGroups/outboundNetworkDependenciesEndpoints/read",
          "Microsoft.ContainerInstance/containerGroups/providers/Microsoft.Insights/diagnosticSettings/read",
          "Microsoft.ContainerInstance/containerGroups/providers/Microsoft.Insights/diagnosticSettings/write",
          "Microsoft.ContainerInstance/containerGroups/providers/Microsoft.Insights/metricDefinitions/read",
          "Microsoft.ContainerInstance/containerGroups/operationResults/read",
          "Microsoft.ContainerInstance/containerGroupProfiles/read",
          "Microsoft.ContainerInstance/containerGroupProfiles/write",
          "Microsoft.ContainerInstance/containerGroupProfiles/delete",
          "Microsoft.ContainerInstance/containerGroupProfiles/revisions/read",
          "Microsoft.ContainerInstance/containerGroupProfiles/revisions/deregister/action",
          "Microsoft.ContainerInstance/containerScaleSets/read",
          "Microsoft.ContainerInstance/containerScaleSets/write",
          "Microsoft.ContainerInstance/containerScaleSets/delete",
          "Microsoft.ContainerInstance/operations/read",
          "Microsoft.ContainerInstance/serviceassociationlinks/delete",
          "Microsoft.ContainerInstance/locations/validateDeleteVirtualNetworkOrSubnets/action",
          "Microsoft.ContainerInstance/locations/deleteVirtualNetworkOrSubnets/action",
          "Microsoft.ContainerInstance/locations/*/read",
          "Microsoft.ContainerRegistry/locations/operationResults/read",
          "Microsoft.ContainerRegistry/registries/read",
          "Microsoft.ContainerRegistry/registries/write",
          "microsoft.OperationalInsights/locations/operationStatuses/read",
          "Microsoft.OperationalInsights/workspaces/tables/write",
          "Microsoft.OperationalInsights/workspaces/tables/read",
          "Microsoft.OperationalInsights/workspaces/tables/delete",
          "Microsoft.OperationalInsights/workspaces/write",
          "Microsoft.OperationalInsights/workspaces/read",
          "Microsoft.OperationalInsights/workspaces/delete",
          "Microsoft.OperationalInsights/workspaces/sharedkeys/action",
          "Microsoft.OperationalInsights/workspaces/listKeys/action",
          "Microsoft.OperationalInsights/workspaces/regenerateSharedKey/action",
          "Microsoft.OperationalInsights/workspaces/search/action",
          "Microsoft.OperationalInsights/workspaces/purge/action",
          "Microsoft.OperationalInsights/workspaces/analytics/query/action",
          "Microsoft.OperationalInsights/workspaces/analytics/query/schema/read",
          "Microsoft.OperationalInsights/workspaces/api/query/action",
          "Microsoft.OperationalInsights/workspaces/api/query/schema/read",
          "Microsoft.OperationalInsights/workspaces/availableservicetiers/read",
          "Microsoft.OperationalInsights/workspaces/features/clientGroups/members/read",
          "Microsoft.OperationalInsights/workspaces/configurationscopes/read",
          "Microsoft.OperationalInsights/workspaces/configurationscopes/write",
          "Microsoft.OperationalInsights/workspaces/configurationscopes/delete",
          "Microsoft.OperationalInsights/workspaces/features/generateMap/read",
          "Microsoft.OperationalInsights/workspaces/intelligencepacks/read",
          "Microsoft.OperationalInsights/workspaces/intelligencepacks/enable/action",
          "Microsoft.OperationalInsights/workspaces/intelligencepacks/disable/action",
          "Microsoft.OperationalInsights/workspaces/linkedservices/read",
          "Microsoft.OperationalInsights/workspaces/linkedservices/write",
          "Microsoft.OperationalInsights/workspaces/linkedservices/delete",
          "Microsoft.OperationalInsights/workspaces/features/machineGroups/read",
          "microsoft.operationalinsights/workspaces/features/machineGroups/read",
          "Microsoft.OperationalInsights/workspaces/managementgroups/read",
          "Microsoft.OperationalInsights/workspaces/metricDefinitions/read",
          "Microsoft.OperationalInsights/workspaces/notificationsettings/read",
          "Microsoft.OperationalInsights/workspaces/notificationsettings/write",
          "Microsoft.OperationalInsights/workspaces/notificationsettings/delete",
          "Microsoft.OperationalInsights/workspaces/restoreLogs/write",
          "microsoft.operationalinsights/workspaces/restoreLogs/write",
          "Microsoft.OperationalInsights/workspaces/rules/read",
          "microsoft.operationalinsights/workspaces/rules/read",
          "Microsoft.OperationalInsights/workspaces/savedSearches/read",
          "Microsoft.OperationalInsights/workspaces/savedSearches/write",
          "Microsoft.OperationalInsights/workspaces/savedSearches/delete",
          "Microsoft.OperationalInsights/workspaces/scopedprivatelinkproxies/read",
          "Microsoft.OperationalInsights/workspaces/scopedprivatelinkproxies/write",
          "Microsoft.OperationalInsights/workspaces/scopedprivatelinkproxies/delete",
          "Microsoft.OperationalInsights/workspaces/searchJobs/write",
          "Microsoft.OperationalInsights/workspaces/schema/read",
          "Microsoft.OperationalInsights/workspaces/features/serverGroups/members/read",
          "Microsoft.OperationalInsights/workspaces/tables/query/read",
          "Microsoft.OperationalInsights/workspaces/providers/Microsoft.Insights/logDefinitions/read",
          "Microsoft.OperationalInsights/workspaces/usages/read",
          "Microsoft.OperationalInsights/workspaces/views/read",
          "Microsoft.OperationalInsights/workspaces/views/delete",
          "Microsoft.OperationalInsights/workspaces/views/write",
          "Microsoft.OperationalInsights/workspaces/listKeys/read",
          "Microsoft.OperationalInsights/workspaces/operations/read",
          "Microsoft.OperationalInsights/workspaces/upgradetranslationfailures/read",
          "Microsoft.OperationalInsights/workspaces/search/read",
          "Microsoft.OperationalInsights/workspaces/providers/Microsoft.Insights/diagnosticSettings/Read",
          "Microsoft.OperationalInsights/workspaces/providers/Microsoft.Insights/diagnosticSettings/Write",
          "Microsoft.OperationalInsights/workspaces/query/read",
          "Microsoft.Storage/storageAccounts/blobServices/containers/delete",
          "Microsoft.Storage/storageAccounts/blobServices/containers/read",
          "Microsoft.Storage/storageAccounts/blobServices/containers/write",
          "Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
        ],
        "notActions": [
          "Microsoft.Authorization/*/Delete",
          "Microsoft.Authorization/*/Write",
          "Microsoft.Authorization/elevateAccess/Action"
        ],
        "dataActions": [
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/deleteBlobVersion/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/permanentDelete/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/filter/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/manageOwnership/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/modifyPermissions/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/runAsSuperUser/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/immutableStorage/runAsSuperUser/action",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags/read",
          "Microsoft.Storage/storageAccounts/blobServices/containers/blobs/tags/write",
          "Microsoft.Storage/storageAccounts/fileServices/readFileBackupSemantics/action",
          "Microsoft.Storage/storageAccounts/fileServices/writeFileBackupSemantics/action",
          "Microsoft.Storage/storageAccounts/fileServices/takeOwnership/action",
          "Microsoft.Storage/storageAccounts/fileServices/fileshares/files/read",
          "Microsoft.Storage/storageAccounts/fileServices/fileshares/files/write",
          "Microsoft.Storage/storageAccounts/fileServices/fileshares/files/delete",
          "Microsoft.Storage/storageAccounts/queueServices/queues/messages/read",
          "Microsoft.Storage/storageAccounts/queueServices/queues/messages/write",
          "Microsoft.Storage/storageAccounts/queueServices/queues/messages/delete",
          "Microsoft.Storage/storageAccounts/tableServices/tables/entities/read",
          "Microsoft.Storage/storageAccounts/tableServices/tables/entities/write",
          "Microsoft.Storage/storageAccounts/tableServices/tables/entities/delete"
        ],
        "notDataActions": []
      }
    ]
  }
}

You must grant integrate.ai access to run the task runner in your cloud, by creating a limited permission role for the runtime.

To create the runtime role:

In the Azure portal, select the Resource Groups service and select the resource group created in Step 1.
Select the Access Control (IAM) section and click Add Custom Role from the drop down.
Select the JSON tab. Copy and paste the permissions code block shown in the code panel. Update the code block to use the correct resource group in assignableScopes:

"assignableScopes": ["/subscriptions/<SUBSCRIPTION ID>/resourceGroups/<RESOURCE GROUP NAME>"]

4. Once completed, click Review + Create.

Create Runtime Service Principal

In the Azure portal, select the App Registration service and click New Registration.
Create an app with a name following your policies eg: iai-taskrunner-runtime-app.
Once the app is registered, generate a secret for it under the Certificates & Secrets tab.

You will need the application id and secret when creating a taskrunner.

Role Assignments for Runtime Service Principal

In the Azure portal, select the Resource Groups service and click on the Resource Group created in Step 1.
Select the Access Control (IAM) section and click Add Role Assignment from the drop down.
Assign the following Job function Roles to the Runtime Service Principal iai-taskrunner-runtime-app:
- Monitoring Metrics Publisher
- Reader
- Storage Blob Data Contributor
- iai-taskrunner-runtime (Custom Role created above)

Bind to an existing Azure Virtual Network

Optional configuation: You can bind to an existing virtual network by creating a network and providing the subnet_id in the Advanced settings section of the Azure task runner registration.

Create a Virtual Network in your resource group.
In the network properties click "Subnets" and add a subnet. The subnet delegation should be Microsoft.ContainerInstance/containerGroups.
Record the subnet ID and provide it to the user who will create the task runner.

Note: You can bind to a network in a resource group different from the one used to create a taskrunner.

To do so:

The network resource group must be listed in the assignableScopes section of the runtime role.
The task runner service principal must be assigned runtime role on the virtual network in the resource group.

This completes the Azure resource configuration for task runners.

Back to Quickstart

On-prem Task Runner (Server)

On-premises task runners consist of two parts:

A task runner registered as an On-premise Server through the Task Runner page in the integrate.ai workspace.
An on-premise service that is installed in a compatible environment, such as a virtual machine (VM) or bare metal machine.

You install the on-prem task runner agent through the integrate.ai command line tool (IAI CLI). In addition to the agent, a dedicated on-prem ECS cluster is created in the integrate.ai infrastructure to maintain the status of the on-prem agents, tasks, and logs.

PreRequisites for On-Prem task runner:

Any VM or compatible environment that allows root user access
Root user access - required to install the agent

Register an on-prem task runner

# Example configuration for notebooks:
container_path = "/data"
train_path1 = f"{container_path}/train_silo0.parquet"
test_path1 = f"{container_path}/test.parquet"

train_path2_aws = f"{container_path}/train_silo1.parquet"
test_path2_aws = f"{container_path}/test.parquet"

active_train_path = f"{container_path}/active_train.parquet"
active_test_path = f"{container_path}/active_test.parquet"
passive_train_path_aws = f"{container_path}/passive_train.parquet"
passive_test_path_aws = f"{container_path}/passive_test.parquet"

aws_storage_path = f'{container_path}}/model'

vfl_predict_active_storage_path = f'{container_path}/vfl_predict/active_predictions.csv'
vfl_predict_passive_storage_path_aws = f'{container_path}/vfl_predict/passive_predictions.csv'

# storage path for models
vfl_predict_passive_storage_path_aws = f'{container_path}/vfl_predict/passive_predictions.csv'

Task runners simplify the process of running training sessions on your data.

Step 1: To register an on-prem task runner:

Log in to your integrate.ai workspace.
In the left navigation bar, Click Settings.
Under Workspace, click Task Runners.
Click Register to start registering a new task runner.
Select Server under the On-premises section.
Follow the instructions provided.
- Task runner name - must be unique
- Storage path - enter the default storage path location on the VM. For example: /data. Note that this must be a path location and not simply a folder name.
Click Register. Wait for the installation to complete.

Step 2: Install the task runner agent:

Create a Python virtual env (venv) on the VM. Agent installation must be done as root user. Execute the command sudo su before installation. python3 -m venv /home/{installation dir}
Install the IAI CLI tool: pip install integrate-ai

Step 3: Register the VM instance as an agent for the on-prem task runner created in Step 1:

Register the on-prem node with the task runner using the following command: iai onprem_node install
When prompted, provide your IAI_TOKEN.
When prompted, provide the name of the task runner you created in Step 1.
Wait for registration to complete.

Step 4: Create and run a session using the on-prem task runner:

Modify any dataset paths to match the storage location path from Step 1.

Removing an on-prem task runner

To remove a VM task runner agent instance, run the following command: iai onprem_node uninstall

USING INTEGRATE.AI

Workspace Administrator Workflow

Workspace Administrators have full control over the entire workspace, from adding and removing users and assigning roles, to controlling administrative and billing information. There must always be at least one user with this role to manage the workspace.

Invite users to the workspace.
Register an AWS task runner or Azure task runner for the data custodians and model builders to use to register datasets and perform model training.
If the enterprise IT landscape requires ingress and egress exceptions for firewalls, provide the IP addresses from the task runners to the IT Administrator so that the necessary ports can be opened for the white-listed IPs. See Whitelisting task runner IPs for details.

Register an AWS Task Runner

Task runners simplify the process of running training sessions on your data.

Note: before attempting to register a task runner, ensure you have completed the AWS configuration for task runners.

To register an AWS task runner:

Log in to your integrate.ai workspace.
In the left navigation bar, Click Settings.
Under Workspace, click Task Runners.
Click Register to start registering a new task runner.
Select the service provider - Amazon Web Services.
Provide the following information:
- Task runner name - must be unique
- Provisioner role ARN - the ARN created by the IT Administrator.
- Runtime role ARN - the ARN created by the IT Administrator.
- Region - select the AWS region to run in from the dropdown
- Storage Path - by default the task runner creates a bucket for you to upload data into (e.g. s3://{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai). Only the default S3 bucket and other buckets ending in *integrate.ai are supported. If you are not using the default bucket created by the task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
- vCPU - the number of virtual CPUs the task runner is allowed to use. The default is 8.
- Memory size (GB) - the amount of memory to allocate to the task runners. The default is 32GB.
Click Save. Wait for the status to change to Online.

Optional Advanced settings for AWS task runners

There are several options in the Advanced settings section of the form that enable you to have fine-grain control over the task runner.

Container registry URL - Provide the URL to the S3 bucket containing the integrate.ai client image.
The format is: s3://<bucket URL>/<image name>.
Use an existing VPC - Provide the following information for your existing VPC configuration.
- Existing VPC ID
- Existing VPC public subnets
- Existing VPC private subnets
- Existing client security group
- Existing server security group
Create a new VPC in a different CIDR block - Provide the following information to create a new VPC in a specified CIDR block.
- Custom VPC CIDR
- Custom private subnet CIDR
- Custom public subnet CIDR
- Custom CIDR newbits
Use existing KMS keys - Provide the following infomation to use your own KMS keys instead of those generated by integrate.ai for the task runner.
- KMS data ID
- KMS secret ID
Use golden AMI - Provide the AMI ID for the golden AMI.

After successfully creating a task runner, you can use it to perform training tasks. You can reuse the task runner; there is no need to create a new one for each task.

Back to Quickstart

Register a dataset (AWS)

Log in to your integrate.ai workspace.
Click *Library* in the left navigation bar.
On the *Datasets* tab, click *Register dataset*.
Select a task runner to manage tasks related to your dataset. Note: If no task runners exist, ask your Workspace Administrator or a Model Builder to create one.
Click *Next*.
On the *Dataset details and privacy controls* page, type a name and description for the dataset.
Specify the URI of the dataset, using the `s3://` format. Ensure that the prepared Parquet or CSV file(s) is located in the S3 bucket that your Task Runner has access to.
(Optional) If you have metadata to associate with the dataset, upload it in the *Attachments* section.
Click *Connect*.

Your dataset is now registered and can be used in a notebook.

Back to Quickstart

Use an AWS task runner and a registered dataset in a notebook

eda_task_group_context = (
        SessionTaskGroup(eda_session) \
        // client 1 uses a registered dataset
        .add_task(iai_tb_aws.eda(dataset_name="silo0_aws"))\  
        //client 2 uses a non-registered dataset
        .add_task(iai_tb_aws.eda(dataset_name="dataset_two", dataset_path=passive_train_path))\  
    .start()
    )

After you have registered your dataset with a task runner, you only need to specify the dataset name in the task to be done.

In this example code for creating a task group, the first task is running the server. The second task is starting a client that uses a registered dataset. The third task is starting a client with a non-registered dataset; in this case, you must specify both dataset name and path.

Sample AWS notebook and data

The integrateai_taskrunner_AWS.ipynb notebook provides examples of how to use an AWS task runner for different federated learning tasks.

You must install the integrate.ai SDK locally in order to run the notebook examples. For instructions, see Installing the SDK. Note that you do not need to install the integrate.ai Client or Server, as the task runner manages the client and server operations for you in your cloud environment.

Download the notebooks here, and the sample data here.

AWS Task Runner Test Script

This tool is intended to sanity test an existing task runner. Locate the script in the SDK folder:

integrate_ai_sdk/src/integrate_ai_sdk/scripts/iai_test_taskrunner.py

Task runner test script Usage

$ python -m venv iai-test-venv
$ source iai-test-venv/bin/activate
$ export IAI_TOKEN=<your token>
$ pip install integrate-ai
$ iai sdk install
$ python iai_test_taskrunner.py <the taskrunner name to test>

The code example on the right shows the commands to run the steps described below.

Copy data to the taskrunner bucket. For example: `$ AWS_PROFILE= aws s3 sync s3://public.s3.integrate.ai/integrate_ai_examples/synthetic/ s3://}/example_data/')`
Create a token in your integrate.ai workspace. Set the value to the `IAI_TOKEN` environment variable.
Create a virtual environment and install the `integrate-ai` CLI tool and SDK. Note: Python 3.9 or 3.10 is required.
Specify the name of the task runner to test and run the test script.

Example output for a workspace "mine" with task runner "slim"

Expecting data sets:
s3://mine-slim.integrate.ai/example_data/train_silo0.parquet
s3://mine-slim.integrate.ai/example_data/train_silo1.parquet
If these datasets do not exist yet, copy them from the examples:
$ AWS_PROFILE=<profile to access bucket for taskrunnner slim> aws s3 sync s3://public.s3.integrate.ai/integrate_ai_examples/synthetic/ s3://mine-slim.integrate.ai/example_data/

Initializing EDA session
session_id=$9d85ca4369
Waiting for completion
Check https://login-mine.integrateai.net/sessions/9d85ca4369/details for details

Session finished with status: Completed

The example on the right shows the output of a successful task runner test using the testing script.

Register an Azure Task Runner

Task runners simplify the process of running training sessions on your data.

Note: before attempting to register a task runner, ensure you have completed the Azure configuration for task runners.

To register an Azure task runner:

Log in to your integrate.ai workspace.
In the left navigation bar, Click Settings.
Under Workspace, click Task Runners.
Click Register to start registering a new task runner.
Select the service provider - Microsoft Azure.
Provide the following information:
- Task runner name - must be unique
- Region - select from the list
- Resource group - must be an existing dedicated resource group
- Service principal ID - this is the appId from the Azure CLI output of creating a service principal.
- Service principal secret - this is the password from the Azure CLI output.
- Runtime Service principal ID - this is the application ID of the App Registration created.
- Runtime Service principal secret - this is the secret generated for the application.
- Subscription ID - the ID of your Microsoft Azure subscription. Can be found on the Azure dashboard.
- Tenant ID - this is the tenantId from the Azure CLI output of creating a service principal.
- vCPU - the number of virtual CPUs the task runner is allowed to use. The default is 4.
- Memory size (MB) - the amount of memory to allocate to the task runners. The default is 8192MB.
- Storage Account - Displays the default storage account name: <taskrunnername>storage.
- Advanced settings - Container registry URL - Optional Provide the URL to the Azure container registry containing the integrate.ai container images. The format is: <container registry>.azurecr.io/<image name>. For example: iairepo.azurecr.io/edge/fl-client. More information about container registries is available here.
- Advanced settings - Subnet ID - Optional Provide the subnet ID for your Azure Virtual Network.
Click Save. Wait for the status to change to Online.

After successfully creating a task runner, you can use it to perform training tasks. You can reuse the task runner; there is no need to create a new one for each task.

Back to Quickstart

Register a dataset (Azure)

Log in to your integrate.ai workspace.
Click *Library* in the left navigation bar.
On the *Datasets* tab, click *Register dataset*.
Select a task runner to manage tasks related to your dataset. Note: If no task runners exist, ask your Workspace Administrator or a Model Builder to create one.
Click *Next*.
On the *Dataset details and privacy controls* page, type a name and description for the dataset.
Specify the URI of the dataset, using the `azure://` format. Ensure that the prepared Parquet or CSV file(s) is located in the Azure Blob that your Task Runner has access to.
(Optional) If you have metadata to associate with the dataset, upload it in the *Attachments* section.
Click *Connect*.

Your dataset is now registered and can be used in a notebook.

Back to Quickstart

Use an Azure task runner and registered dataset in a notebook

eda_task_group_context = (
        SessionTaskGroup(eda_session) \
        // client 1 uses a registered dataset
        .add_task(iai_tb_azure.eda(dataset_name="silo0_az"))\  
        //client 2 uses a non-registered dataset
        .add_task(iai_tb_azure.eda(dataset_name="dataset_two", dataset_path=passive_train_path))\  
    .start()
    )

After you have registered your dataset with a task runner, you only need to specify the dataset name in the task to be done.

In this example code for creating a task group, each task starts a client. The first task is starting a client that uses a registered dataset. The second task is starting a client with a non-registered dataset; in this case, you must specify both dataset name and path.

Sample Azure notebook and data

The integrateai_taskrunner_Azure.ipynb notebook provides examples of how to use an Azure task runner for different federated learning tasks.

Download the notebooks here, and the sample data here.

Data Requirements

To run a session, the data must be prepared according to the following standards:

The data configuration dictated in the Session configuration
The data requirements given below for running a modelling pipeline with integrate.ai

integrate.ai supports

Private Record Linkage (PRL)
- Discover the overlapping population between your data and your partner's data without exposing individual records
Exploratory Data Analysis (EDA)
- Generate summary statistics on a dataset to learn about the properties and distributions of the dataset being evaluated
Horizontal federated learning (HFL)
- Train models across different siloed datasets that hold different samples over the same set of features, without transferring data between each silo
Vertical federated learning (VFL)
- Train models across different siloed datasets that hold different features belonging to the same set of samples, without transferring data between each silo

Data format requirements

Data should be in Apache Parquet format (preferred) or .csv, and can be hosted locally, in an S3 bucket, or in an Azure blob. Note: .csv is only recommended for data files less than 1GB.
Proper row groups must be used when exporting the data to parquet files. Depending on the tool used to write the parquet file, you can do so by specifying the row_group_size as the number of records per group, or set the block_size as the physical storage size per group (recommended value is 100MB).

a. The recommended value for row_group_size is 500k records. Large datasets are processed by moderate-sized partitions (row groups) by default. These partitions should be large enough that each of them contains sufficient information about the whole data. However, it should not be too large, so that they can be processed efficiently with regard to memory and computation usage.
The matching columns (that is, the columns used in the matching algorithm between datasets for PRL) are defined (that is, you must specify which columns should be used for the match). For information about handling duplicated values of these matching columns, see the section Configure PRL session.
Column names must be consistent across datasets. All column names (predictors and targets) must contain only alphanumeric characters (letters, numbers, dash -, underscore _) and start with a letter. You can select which columns you want to use in a specific training session.

Feature engineering requirements

In order for data to be used to run predictive modelling sessions in the platform (ie GLMs, GBMs, NNs etc), it must be fully feature engineered. Note that the registered datasets must always include the original columns with the original name, as well as the feature engineered columns.

The requirements for the processed columns that will be used as input for modelling pipelines are as follows

All processed columns must be numerical, with the exception of ID columns used for PRL.
Processed columns must not contain NULL values; meaning that missing values are imputed.
Each row must correspond to one observation.
For processed columns that are continuous variables, they must be standardized to have mean = 0 and std = 1. This requirement is highly recommended for GLM and NN, but is not required for GBM.
For categorical variables, features must be encoded (e.g., by one-hot-encoding). For example, if there is a column marital_status, and the values are married and divorced, there should be three columns in the dataset: the original column marital_status, marital_status_married, and marital_status_divorced.
Feature engineering must be consistent across the data products. For example, if the datasets contain categorical values, such as country, these values must be encoded the same way across all the datasets. For the country example, this means that the same country value translates to the same numerical values across all datasets that will participate in the training.

Sample Weighting

By default, each sample has an equal weight when computing the loss and other subsequent quantities (such as the gradients). To assign different weights to samples, you can assign prior weights.

Conceptually, prior weights allow information about the known credibility of each observation to be incorporated in the model. For example, if modeling claims frequency, one observation might relate to one month's exposure, and another to one year's exposure. There is more information and less variability in the observation relating to the longer exposure period, and this can be incorporated in the model by defining the prior weight to be the exposure of each observation. In this way observations with higher exposure are deemed to have lower variance, and the model will consequently be more influenced by these observations.

Sample weight usage

# HFL sample weight example

data_config = {
    "predictors": ...,
    "target": ...,
    "sample_weight": "exposure"
}

# VFL sample weight example

data_config = {
    "active": {
        "predictors": ...,
        "target": ...,
        "sample_weight": "exposure"
    },
    "passive": {
        ...
    }
}

Sample weights are required to be strictly positive. No other preprocessing is necessary. In both HFL and VFL, the weights will be automatically normalised to sum to 1 within each batch, to ensure consistent scaling of the loss across batches.

For VFL, the sample weights can only be provided by the active party because they are the ones computing the loss and other byproducts (e.g., gradients).

To use sample weights, you must add the paramater sample_weight to the data configuration.

In the code examples provided, the sample weights column is named exposure.

Back to Quickstart

Installing the SDK

Pre-requisites

To run the integrate.ai SDK samples and build models, ensure that your environment is configured properly.

Required software:

Python 3.9
Pip 22.2 (or later)

Installing the SDK

Generate an access token

To install the SDK and other components locally, you must generate an access token through the workspace UI.

Log in to your workspace.
On the getting started page, click Generate.
Copy the acccess token and save it to a secure location.

Treat your API tokens like passwords and keep them secret. When working with the API, use the token as an environment variable instead of hardcoding it into your programs. In this documentation, the token is referenced as <IAI_TOKEN>.

Install integrate.ai packages

At a command prompt on your machine, run the following command to install the CLI management tool:

pip install integrate-ai

Install the SDK package using the access token you generated.

iai sdk install --token <IAI_TOKEN>

Updating your SDK

Typically, each time integrate.ai releases a software update, you must update your local SDK installation. To check the SDK version, run iai sdk version.

You can compare the version against the feature release version in the Release Notes to ensure that your environment is up to date.

The command to update the SDK is the same as you used to install it initially.

iai sdk install --token <IAI_TOKEN>

For Developers

For those who are interested in integrating the system with their own product, or writing software that makes use of integrate.ai capabilities, there are several options, including a full-featured SDK and a RESTful API. Refer to the API Documentation for details of the functions and operations available through the SDK.

For local development, you can install the integrate.ai client and server using the same access token you used to install the SDK. The client and server are Docker container images. Alternatively, you can use task runners and only run the SDK locally.

iai client pull --token <IAI_TOKEN>
iai server pull --token <IAI_TOKEN>

Updating your local environment

When a new version of an integrate.ai package is released, follow these steps to update your local development environment.

Note: If you are using task runners, you only need to update your SDK.

Check version numbers

From a command prompt or terminal, use the pip show integrate-ai command to see the version of the integrate-ai command line interface (CLI).

To check the SDK version, run iai sdk version.

You can compare the version against the feature release version in the Release Notes to ensure that your environment is up to date.

Get the latest packages

Run the following command to update the CLI and all dependent packages. pip install -U integate-ai

Update the SDK: iai sdk install

Update the client: iai client pull

Update the server: iai server pull

Setting up SSO

Okta SSO for integrate.ai

This section describes the process for configuring an Okta Application to enable single-sign-on for the integrate.ai platform.

Create the Okta application and invite users.
[Optional] Create groups.
Add policy rules for the app.
Send the information to integrate.ai.
Create an event handler.

Step 1: Create the Okta Application

Log into Okta Admin Console.
In the left navigation bar, click on the Applications drop down and select Applications.
Click Create App Integration ans Select OIDC - OpenID Connect.
Select Web Application and click Next.
Provide the following details:
- App Integration Name - integrate.ai
- Grant type - In the section "Client acting on behalf of user", select Authorization Code and Refresh Token.
- Sign-in redirect URIs - Provide the following URI ( Replace with your workspace name ) https://iai-api-<workspace>-prod.integrateai.net/login/callback
- Sign-out redirect URIs - Provide the following URI (Replace with your workspace name ) https://iai-api-<workspace>-prod.integrateai.net/logout/callback
- Assignments - Select "Skip group assignment for now". This information depends on the requirements of your organization and can be updated later.
Note: For any values not specified, leave the default setting as is.
Click Save.
If you want to support Login initiated by "Either Okta or App", you would need to set the following values:
- Login flow - Redirect to app to initiate login (OIDC Compliant)
- Initiate login URI - https://login-<workspace>.integrateai.net/login (This is the URL that takes you to your integrate.ai workspace)

Invite Users to the App

In the left navigation bar, select Directory > People > Add person.
The following fields are required:
- First name
- Last name
- Username (email)
Optional - add the user to a group [Step 2]. By default, users are assigned to the Data Custodian Role in the integrate.ai workspace.
Click Save or Save and Add Another.
Select the user and click Assign Application.
Select the application created above.
Click Save.

Step 2: Create Groups [Optional]

In your integrate.ai workspace, users are assigned Roles that govern their permissions and actions within the workspace. Create groups in the Okta app, that will map your users to their correct roles in the integrate.ai workspace.

Note: This step is optional. It facilitates user management, but users can also be assigned roles within the workspace directly by an administrator.

Create Groups

In the left navigation bar, select Directory -> Click Groups -> Click Add Group.
Add the following groups
- Administrator - can see and perform all actions in the workspace and can manage users.
- Model Builder - can build and train models, add datasets, and perform other machine learning tasks, but cannot manage users.
- Data Custodian - can manage datasets and run sessions, but cannot manage users.

Add users to Groups

Under Directory, select People.
Select the appropriate users and add them into the groups created above.

Send Group information to integrate.ai

In the left navigation bar, select Applications.
Select the app you created in Step 1 -> Click the Sign On tab.
In the OpenID Connect ID Token section, click Edit.
Set the Groups claim type to Filter.
Create a Groups claim filter that contains the three groups created above. This filter information is sent to integrate.ai automatically via the Okta API.
Click Save.

Step 3: Add Policy Rules

Policy rules enable you to control logout time and give integrate.ai permission to get basic user information.

In the left navigation bar, select Security -> Click API.
Select the default Authorization server.
Click Access Policies -> Add New Access Policy -> Add rule.
Update any information to define the policy.
Specify the following scopes: openid, profile, email, and groups.
Click Create Rule.
Update the policy and assign it to your client application.
Click Update Policy.

Step 4: Send information to integrate.ai

The application is now configured for use with integrate.ai. Provide the following information to your integrate.ai Customer Success Engineer:

Client ID
Client secret
Metadata_URL (This follows the format of https://<YOUR OKTA DOMAIN>/.well-known/openid-configuration)

Step 5: Create an Event Handler

The event handler enables the integrate.ai platform to deactivate API tokens for users that are deactivated in Okta. The event handler requires an integrate.ai Okta token, which will be provided to you by your Customer Success Engineer.

In the left navigation bar, click Workflow -> Select Event Hooks.
Click Create Event Hook.
Provide the following details:

Field	Value
Name	Update user state
URL	`https://iai-api-<workspace>-prod.integrateai.net/user/update-okta-user`
Authentication field	Authorization
Authentication secret	Bearer `<IAI_TOKEN>`
Subscribe to events	User activated User suspended User deactivated User reactivated User unsuspended User deleted

Click Save & Continue.
Click Verify to validate that the event hook is active.

You should now be able to log in to your integrate.ai workspace using Okta SSO.

HFL MODEL TRAINING

Horizontal Federated Learning (HFL) is used in scenarios where datasets share the same feature space (same type of columns) but differ in samples (different rows).

integrate.ai supports the following HFL model types:

FFNet - Feed forward neural network
Linear inference
Gradient Boosted Models (GBM)

Vertical Federated Learning (VFL) is also supported. Click here to learn more.

FFNet Model Training with a Sample Local Dataset (iai_ffnet)

To help you get started with federated learning in the integrate.ai system, we've provided a tutorial based on synthetic data with pre-built configuration files that you can run using a task runner. In this example, you will train a federated feedforward neural network (iai_ffn) using data from two datasets. The datasets, model, and data configuration are provided for you.

The sample notebook (integrateai_taskrunner_AWS.ipynb) contains runnable code snippets for exploring the SDK and should be used in parallel with this example. This documentation provides supplementary and conceptual information to expand on the code demonstration.

Tip: make sure you have installed the SDK and set up a task runner.

Open the integrateai_taskrunner_AWS.ipynb notebook to test the code as you walk through this exercise.

Understanding Models

model_config = {
    "experiment_name": "test_synthetic_tabular",
    "experiment_description": "test_synthetic_tabular",
    "strategy": {     
        "name": "FedAvg",       // Name of the federated learning strategy
        "params": {}
        },
    "model": {                  // Parameters specific to the model type 
        "params": {
            "input_size": 15, 
            "hidden_layer_sizes": [6, 6, 6], 
            "output_size": 2
                   }
            },
    "balance_train_datasets": False, // Performs undersampling on the dataset
    "ml_task": {                    // Specifies the federated learning strategy
        "type": "classification",
        "params": {
            "loss_weights": None,  
        },
    },
    "optimizer": {
        "name": "SGD",              // Name of the PyTorch optimizer used 
        "params": {
            "learning_rate": 0.2,
            "momentum": 0.0}
            },
    "differential_privacy_params": {    // Defines the differential privacy parameters
        "epsilon": 4, 
        "max_grad_norm": 7
        },
    "save_best_model": {
        "metric": "loss",           // To disable this and save the model from the last round, set to None
        "mode": "min",
    },
}

integrate.ai has several standard model classes available, including:

Feedforward Neural Nets (iai_ffn) - uses the same activation for each hidden layer.
Generalized Linear Models (iai_glm) - uses a linear feedforward layer.
Gradient Boosted Models (iai_gbm) - uses the sklearn implementation of HistGradientBoostingModels.
Linear Inference Models (iai_linear_inference) - performs statistical inference on model coefficients for linear and logistic regression.
Principal Component Analysis (iai_pca) - performs multivariate linear transformation which calculates the principal components based on results from the principal component analysis.

Model configuration

These standard models are defined using JSON configuration files during session creation. The model configuration (model_config) is a JSON object that contains the model parameters for the session.

There are five main properties with specific key-value pairs used to configure the model:

strategy - Select one of the available federated learning strategies from the strategy library.
model - Defines the specific parameters required for the model type.
ml-task - Defines the federated learning strategy and associated parameters.
optimizer - Defines the parameters for the PyTorch optimizer.
differential_privacy_params - Defines the differential privacy parameters. See Differential Privacy for more information.

The example in the notebook is a model provided by integrate.ai. For this tutorial, you do not need to change any of the values.

Data configuration

data_config = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

The data configuration is a JSON object where the user specifies predictor and target columns that are used to describe input data. This is the same structure for both GLM and FNN.

Once you have created or updated the model and data configurations, the next step is to create a training session to begin working with the model and datasets.

Specify paths to datasets

aws_taskrunner_profile = "<workspace name>" # This is your workspace name. For example, if your login URL is `https://login-abc.integrateai.net/`, then your workspace name is `abc`.
aws_taskrunner_name = "<taskrunner>" # Task runner name - must match what was supplied in UI to create task runner. Example: `abcrunner`

#By default, an S3 bucket is created when you create a new task runner. This example assumes you have uploaded the files to the default bucket.
base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

# Example datapaths. Make sure that the data you want to work with exists in the base_aws_bucket for your task runner.
# HFL datapaths
train_path1 = f's3://{base_aws_bucket}/synthetic/train_silo0.parquet'
test_path1 = f's3://{base_aws_bucket}/synthetic/test.parquet'
train_path2 = f's3://{base_aws_bucket}/synthetic/train_silo1.parquet'
test_path2 = f's3://{base_aws_bucket}/synthetic/test.parquet'

Create and start the training session

session = client.create_fl_session(
    name="Testing notebook",
    description="I am testing session creation through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_ffnet",
    model_config=model_config,
    data_config=data_config,
).start()

session.id

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session.

Create a new session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training datasets (specified as min_num_clients) and two rounds (num_rounds). It returns a session ID that you can use to track and reference your session.

The package_name specifies the federated learning model package - in the example, it is iai_ffnet however, other packages are supported. See Model packages for more information.

Join clients to the session

task_group = (
    SessionTaskGroup(training_session)
    .add_task(iai_tb_aws.hfl(train_path=train_path1, test_path=test_path1, use_gpu=False))\
    .add_task(iai_tb_aws.hfl(train_path=train_path2, test_path=test_path2, use_gpu=False))
)
task_group_context = task_group.start()

The next step is to join the session with the sample data. This example has data for two datasets simulating two clients, as specified with the min_num_clients argument. Therefore, to run this example, you add two client tasks to the taskbuilder.

The session begins training after the minimum number of clients have joined the session.

Poll for session results

# Monitor the submitted tasks

for i in task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

task_group_context.monitor_task_logs()

# Wait for the tasks to complete (success = True)

task_group_context.wait(60*5, 2)

Depending on the type of session and the size of the datasets, sessions may take some time to run. In the sample notebook and this tutorial, we poll the server to determine the session status.

You can log information about the session during this time. In this example, we are logging the current round and the clients that have joined the session.

Another popular option is to log the session.metrics().as_dict() to view the in-progress training metrics.

HFL FFNET Session Complete

Congratulations, you have your first federated model! You can test it by retrieving the metrics and making predictions.

To retrieve the session metrics:

training_session.metrics().as_dict()

To plot the session metrics:

fig = training_session.metrics().plot()

Back to HFL model types

Linear Inference Sessions

An overview and example of a linear inference model for performing tasks such as GWAS in HFL.

The built-in model package iai_linear_inference trains a bundle of linear models for the target of interest against a specified list of predictors. It obtains the coefficients and variance estimates, and also calculates the p-values from the corresponding hypothesis tests. Linear inference is particularly useful for genome-wide association studies (GWAS), to identify genomic variants that are statistically associated with a risk for a disease or a particular trait.

This is a horizontal federated learning (HFL) model package.

The integrateai_taskrunner_AWS.ipynb, available in the integrate.ai sample repository, demonstrates this model package using the sample data that is available for download here.

Follow the instructions in the Installing the SDK section to prepare a local test environment for this tutorial.

Overview of the iai_linear_inference package

# Example model_config for a binary target

model_config_logit = {
    "strategy": {"name": "LogitRegInference", "params": {}},
    "seed": 23,  # for reproducibility
}

There are two strategies available in the package:

LogitRegInference - for use when the target of interest is binary
LinearRegInference - for use when the target is continuous

# Example data_config

data_config_logit = {
    "target": "y",
    "shared_predictors": ["x1", "x2"],
    "chunked_predictors": ["x0", "x3", "x10", "x11"]
}

The data_config dictionary should include the following three fields.

target: the target column of interest
shared_predictors: predictor columns that should be included in all linear models. For example, the confounding factors like age, gender in GWAS.
chunked_predictors: predictor columns that should be included in the linear model one at a time. For example, the gene expressions in GWAS.

Note: The columns in all the fields can be specified as either names/strings or indices/integers.

With this example data configuration, the session trains four logistic regression models with y as the target, and x1, x2 plus any one of x0, x3, x10, x11 as predictors.

Create a linear inference training session

train_path1 = f's3://{base_aws_bucket}/synthetic/train_silo0.parquet'
test_path1 = f's3://{base_aws_bucket}/synthetic/test.parquet'
train_path2 = f's3://{base_aws_bucket}/synthetic/train_silo1.parquet'
test_path2 = f's3://{base_aws_bucket}/synthetic/test.parquet'

Specify the file paths.

# Example training session

training_session_logit = client.create_fl_session(
    name="Testing linear inference session",
    description="I am testing linear inference session creation using a task runner through a notebook",
    min_num_clients=2,
    num_rounds=5,
    package_name="iai_linear_inference",
    model_config=model_config_logit,
    data_config=data_config_logit
).start()

training_session_logit.id

For this example, there are two (2) training clients and the model is trained over five (5) rounds.

Argument (type)	Description
name (str)	Name to set for the session
description (str)	Description to set for the session
min_num_clients (int)	Number of clients required to connect before the session can begin
num_rounds (str)	Number of rounds of federated model training to perform
package_name (str)	Name of the model package to be used in the session
model_config (dict)	Contains the model configuration to be used for the session
data_config (dict)	Contains the data configuration to be used for the session

Start the linear inference training session

#Create a task group
#If you are using a registered dataset, use the following pattern to add a client task:
#.add_task(iai_tb_aws.hfl(train_dataset_name="train_name", test_dataset_name="test_name"))

task_group_context = (
    SessionTaskGroup(training_session_logit)
    .add_task(iai_tb_aws.hfl(train_path=train_path1, test_path=test_path1))
    .add_task(iai_tb_aws.hfl(train_path=train_path2, test_path=test_path2)).start()
)

This example demonstrates starting a training session by specifying the train and test paths directly instead of using registered dataset names.

Poll for linear inference session status


for i in task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))
task_group_context.monitor_task_logs()

task_group_context.wait(60*5, 2)

Use a polling function, such as the example provided, to wait for the session to complete.

View training metrics and model details

# Example output

{'session_id': '3cdf4be992',
 'federated_metrics': [{'loss': 0.6931747794151306},
  {'loss': 0.6766608953475952},
  {'loss': 0.6766080856323242},
  {'loss': 0.6766077876091003},
  {'loss': 0.6766077876091003}],
 'client_metrics': [{'user@integrate.ai:dedbb7e9be2046e3a49b28b0131c4b97': {'test_loss': 0.6931748060977674,
    'test_accuracy': 0.4995,
    'test_roc_auc': 0.5,
    'test_num_examples': 4000},
   'user@integrate.ai:339d50e453f244ed9cb2662ab2d3bb66': {'test_loss': 0.6931748060977674,
    'test_accuracy': 0.4995,
    'test_roc_auc': 0.5,
    'test_num_examples': 4000}},
  {'user@integrate.ai:dedbb7e9be2046e3a49b28b0131c4b97': {'test_num_examples': 4000,
    'test_loss': 0.6766608866775886,
    'test_roc_auc': 0.5996664746664747,
    'test_accuracy': 0.57625},
   'user@integrate.ai:339d50e453f244ed9cb2662ab2d3bb66': {'test_num_examples': 4000,
    'test_loss': 0.6766608866775886,
    'test_accuracy': 0.57625,
    'test_roc_auc': 0.5996664746664747}},
  {'user@integrate.ai:339d50e453f244ed9cb2662ab2d3bb66': {'test_loss': 0.6766080602706078,
    'test_accuracy': 0.5761875,
    'test_num_examples': 4000,
...
   'user@integrate.ai:339d50e453f244ed9cb2662ab2d3bb66': {'test_accuracy': 0.5761875,
    'test_roc_auc': 0.5996632246632246,
    'test_num_examples': 4000,
    'test_loss': 0.6766078165060236}}],
 'latest_global_model_federated_loss': 0.6766077876091003}

Once the session is complete, you can view the training metrics and model details such as the model coefficients and p-values. In this example, since there are a bundle of models being trained, the metrics are the average values of all the models.

training_session_logit.metrics().as_dict()

Plot the metrics

training_session_logit.metrics().plot()

Example output:

Retrieve the trained model

# Example of retrieving p-values

pv = model_logit.p_values()
pv

#Example p-value output:

x0      112.350396
x3      82.436540
x10     0.999893
x11     27.525280
dtype: float64

The LinearInferenceModel object can be retrieved using the model's as_pytorch method. The relevant information, such as p-values, can be accessed directly from the model object.

model_logit = training_session_logit.model().as_pytorch()

Retrieve the p-values

The .p_values() function returns the p-values of the chunked predictors.

Summary information

The .summary method fetches the coefficient, standard error, and p-value of the model corresponding to the specified predictor.

#Example of fetching summary

summary_x0 = model_logit.summary("x0")
summary_x0`

Example summary output:

Making predictions from a linear inference session

from torch.utils.data import DataLoader
from integrate_ai_sdk.packages.LinearInference.dataset import ChunkedTabularDataset

ds = ChunkedTabularDataset(path=f"{data_path}/test.parquet", **data_config_logit)
dl = DataLoader(ds, batch_size=len(ds), shuffle=False)
x, _ = next(iter(dl))
y_pred = model_logit(x)
y_pred

You can also make predictions with the resulting bundle of models when the data is loaded by the ChunkedTabularDataset from the iai_linear_inference package. The predictions will be of shape (n_samples, n_chunked_predictors) where each column corresponds to one model from the bundle.

#Example prediction output:

tensor([[0.3801, 0.3548, 0.4598, 0.4809],
        [0.4787, 0.3761, 0.4392, 0.3579],
        [0.5151, 0.3497, 0.4837, 0.5054],
        ...,
        [0.7062, 0.6533, 0.6516, 0.6717],
        [0.3114, 0.3322, 0.4257, 0.4461],
        [0.4358, 0.4912, 0.4897, 0.5110]], dtype=torch.float64)
```


[Back to HFL model types](#horizontal-federated-learning)

Gradient Boosted Models (HFL-GBM)

Gradient boosting is a machine learning algorithm for building predictive models that helps minimize the bias error of the model. The gradient boosting model provided by integrate.ai is an HFL model that uses the sklearn implementation of HistGradientBoostingClassifier for classifier tasks and HistGradientBoostingRegresssor for regression tasks.

The GBM sample notebook (integrateai_hfl_gbm.ipynb) provides sample code for running the SDK, and should be used in parallel with this tutorial. You can download the sample notebook here.

Sample HFL GBM Model Configuration and Data Schema

# Model configuration

model_config = {
    "strategy": {
        "name": "HistGradientBoosting",
        "params": {}
    },
    "model": {
        "params": {
            "max_depth": 4, 
            "learning_rate": 0.05,
            "random_state": 23, 
            "max_bins": 128,
            "sketch_relative_accuracy": 0.001,
            "max_iter": 100,
        }
    },

    "ml_task": {"type": "classification", "params": {}}, 

    "save_best_model": {
        "metric": None,
        "mode": "min"
    },
  }

# Data Schema

  data_schema = {
    "predictors": ["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"],
    "target": "y",
}

integrate.ai has a model class available for Gradient Boosted Models, called iai_gbm. This model is defined using a JSON configuration file during session creation.

The strategy for GBM is HistGradientBoosting.

You can adjust the following parameters as needed:

max_depth - Used to control the size of the trees.
learning_rate - (shrinkage) Used as a multiplicative factor for the leaves values. Set this to one (1) for no shrinkage.
max_bins - The number of bins used to bin the data. Using less bins acts as a form of regularization. It is generally recommended to use as many bins as possible.
sketch_relative_accuracy - Determines the precision of the sketch technique used to approximate global feature distributions, which are used to find the best split for tree nodes.
max_iter - The maximum number of iterations of the boosting process, i.e. the maximum number of trees for binary classification. For multiclass classification, n_classes trees per iteration are built. The default is 100.

Set the machine learning task type to either classification or regression.

Specify any parameters associated with the task type in the params section.

The save_best_model option allows you to set the metric and mode for model saving. By default this is set to None, which saves the model from the previous round, and min.

The notebook also provides a sample data schema. For the purposes of testing GBM, use the sample schema as shown.

Configure task runner and data paths

aws_taskrunner_profile = "<workspace>" # This is your workspace name
aws_taskrunner_name = "<taskrunner>" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

# Example datapaths. Make sure that the data you want to work with exists in the base_aws_bucket for your task runner.
# HFL datapaths
train_path1 = f's3://{base_aws_bucket}/synthetic/train_silo0.parquet'
test_path1 = f's3://{base_aws_bucket}/synthetic/test.parquet'
train_path2 = f's3://{base_aws_bucket}/synthetic/train_silo1.parquet'
test_path2 = f's3://{base_aws_bucket}/synthetic/test.parquet'

#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'

base_aws_bucket #Prints the base_aws_bucket name for reference

In this example we are using an AWS task runner to run the training session. You must specify the name, and the name.

Upload the sample data to your S3 bucket and update the HFL datapaths shown to match your upload location.

Create an HFL-GBM training session

# Create a session

gbm_session = client.create_fl_session(
    name="HFL session testing GBM",
    description="I am testing GBM session creation through a notebook",
    min_num_clients=2,
    num_rounds=10,
    package_name="iai_gbm",
    model_config=model_config,
    data_config=data_schema,
).start()

gbm_session.id

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session.

Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and ten rounds (or trees). It returns a session ID that you can use to track and reference your session.

Join the training session

# Add clients to the training session

gbm_task_group_context = (
        SessionTaskGroup(gbm_session) \
        .add_task(iai_tb_aws.hfl(train_path=train_path1, test_path=test_path1))\
        .add_task(iai_tb_aws.hfl(train_path=train_path2, test_path=test_path2))\
        .start()
)

The next step is to join the session with the sample data. This example has data for two datasets simulating two clients, as specified with the min_num_clients argument. The session begins training once the minimum number of clients have joined the session.

gbm_session.id is the ID returned by the previous step
train-path is the path to and name of the sample dataset file
test-path is the path to and name of the sample test file
batch-size is the size of the batch of data
client-name is the unique name for each client

Check the task group status

for i in gbm_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

gbm_task_group_context.monitor_task_logs()

Sessions may take some time to run depending on the computer environment. In the sample notebook and this tutorial, we poll the server to determine the session status.

A session is complete when True is returned from the wait function:

gbm_task_group_context.wait(60*5, 2)

View the HFL-GBM training metrics

After the session completes successfully, you can view the training metrics and start making predictions. 1. Retrieve the model metrics as_dict. 2. Plot the metrics.

Retrieve the model metrics

gbm_session.metrics().as_dict()

#Example metrics output:

{'session_id': '9fb054bc24',
 'federated_metrics': [{'loss': 0.6876291940808297},
  {'loss': 0.6825978879332543},
  {'loss': 0.6780059585869312},
  {'loss': 0.6737175708711147},
  {'loss': 0.6697578155398369},
  {'loss': 0.6658972384035587},
  {'loss': 0.6623568458259106},
  {'loss': 0.6589517279565335},
  {'loss': 0.6556690569519996},
  {'loss': 0.6526266353726387},
  {'loss': 0.6526266353726387}],
 'rounds': [{'user info': {'test_loss': 0.6817054933875072,
    'test_roc_auc': 0.6868823702288674,
    'test_accuracy': 0.7061688311688312,
    'test_num_examples': 8008},
   'user info': {'test_accuracy': 0.5720720720720721,
    'test_num_examples': 7992,
    'test_roc_auc': 0.6637941733389123,
    'test_loss': 0.6935647668830733}},
  {'user info': {'test_accuracy': 0.5754504504504504,
    'test_roc_auc': 0.6740578481919338,
    'test_num_examples': 7992,
    'test_loss': 0.6884922753070576},
   'user info': {'test_loss': 0.6767152831608197,
...
   'user info': {'test_loss': 0.6578156923815811,
    'test_num_examples': 7992,
    'test_roc_auc': 0.7210704078520924,
    'test_accuracy': 0.6552802802802803}}],
 'latest_global_model_federated_loss': 0.6526266353726387}

Plot the metrics

fig = gbm_session.metrics().plot()

Example:

Trained model parameters are accessible from the completed session

Model parameters can be retrieved using the model's as_sklearn method.

model = gbm_session.model().as_sklearn()

model

Working with your existing data

Now that you have downloaded the trained model, you can use it to make predictions on your own data.

Below are example methods you could call on the model.

test_data = pd.read_parquet(f"{data_path}/test.parquet")

test_data.head()

Convert test data to tensors

Y = test_data["y"]

X = test_data[["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "x14"]]

Run model predictions

model.predict(X)

Result: array([0, 1, 0, ..., 0, 0, 1])

from sklearn.metrics import roc_auc_score

y_hat = model.predict_proba(X)

roc_auc_score(Y, y_hat[:, 1])

Result: 0.7082738332738332

Note: When the training sample sizes are small, this model is more likely to overfit the training data.

Back to HFL model types

VFL MODEL TRAINING

Vertical Federated Learning

integrate.ai suports the following model types:

Private Record Linkage (PRL) - also known as overlap
SplitNN
VFL Prediction
VFL Gradient Boosted Models (GBM)
VFL Generalized Linear Models (GLM)

Horizontal Federated Learning (HFL) is also supported. Click here to learn more.

Private Record Linkage (PRL) sessions

Private record linkage sessions create intersection and alignment among datasets to prepare them for vertical federated learning (VFL).

In a vertical federated learning process, two or more parties collaboratively train a model using datasets that share a set of overlapping features. These datasets generally each contain distinct data with some overlap. This overlap is used to define the intersection of the sets. Private record linkage (PRL) uses the intersection to create alignment between the sets so that a shared model can be trained.

Overlapping records are determined privately through a PRL session, which combines Private Set Intersection with Private Record Alignment.

For example, in data sharing between a hospital (party B, the Active party) and a medical imaging centre (party A, the Passive party), only a subset of the hospital patients will exist in the imaging centre's data. The hospital can run a PRL session to determine the target subset for model training.

PRL Session Overview

In PRL, two parties submit paths to their datasets so that they can be aligned to perform a machine learning task.

ID columns (id_columns) are used to produce a hash that is sent to the server for comparison. The secret for this hash is shared between the clients and the server has no knowledge of it. This comparison is the Private Set Intersection (PSI) part of PRL.
Once compared, the server orchestrates the data alignment because it knows which indices of each dataset are in common. This is the Private Record Alignment (PRA) part of PRL.

For information about privacy when performing PRL, see PRL Privacy for VFL.

Configure PRL Session

# Example data config

prl_data_config = {
  "clients": {
    "active_client": {
      "id_columns": ["id"],
      "is_id_unique": True,   //by default, IDs are assumed to be unique
    }
      }
    },
    "passive_client": {
      "id_columns": ["id"],
      "is_id_unique": False,    //set to False if there are duplicated IDs in the dataset
    }
  }
}

This example uses an AWS task runner run the session. Ensure that you have installed the SDK locally, created a task runner and registered a dataset. Use the integrateai_taskrunner_AWS.ipynb notebook to follow along and test the examples shown below by filling in your own variables.

To create the session, specify the data_config that contains the client names and columns to use as identifiers to link the datasets. For example: prl_data_config.

Important: The client names are referenced for the compute on the PRL session and for any sessions that use the PRL session downstream. For example, when running EDA in Intersect mode or for any VFL session, the client names must be the same as specified for the PRL session.

PRL data config

Specify a prl_data_config that indicates the columns to use as identifiers when linking the datasets to each other. The number of items in the config specifies the number of expected clients. In this example, there are two items and therefore two clients submitting data. Their datasets are linked by the "id" column in any provided datasets.

Working with data that contains duplicate IDs

To enable support for data with duplicated IDs:

Specify "is_id_unique" for each dataset (client) in the data config, as shown in the example.
By default, is_id_unique is set to True for all datasets.

Create the PRL session

# Create session example

prl_session = client.create_prl_session(
    name="Testing notebook - PRL",
    description="I am testing PRL session creation with a task runner through a notebook",
    data_config=prl_data_config
).start()

prl_session.id

Create a task builder and task group for PRL

task_group = (SessionTaskGroup(prl_session)\
    .add_task(iai_tb_aws.prl(train_dataset_name="active_train", test_dataset_name="active_test", client_name="active_client"))\
    .add_task(iai_tb_aws.prl(train_dataset_name="passive_train", test_dataset_name="passive_test", client_name="passive_client"))
)

task_group_context = task_group.start()

Create a task group with one task for each of the clients joining the session. In this example, the datasets are registered and referenced by their names, instead of by specifying the name and path.

Monitor submitted jobs

Next, you can check the status of the tasks.

for i in task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

task_group_context.monitor_task_logs()

Submitted tasks are in the pending state until the clients join and the session is started. Once started, the status changes to running.

# Wait for the tasks to complete (success = True)

task_group_context.wait(60*5, 2)

View the overlap statistics

When the session is complete, you can view the overlap statistics for the datasets.

metrics = prl_session.metrics().as_dict() metrics

Example result:

{'session_id': '07d0f8358d',
 'federated_metrics': [],
 'client_metrics': {'passive_client': {'train': {'n_records': 14400,
    'n_overlapped_records': 12963,
    'frac_overlapped': 0.9},
   'test': {'n_records': 3600,
    'n_overlapped_records': 3245,
    'frac_overlapped': 0.9}},
  'active_client': {'train': {'n_records': 14400,
    'n_overlapped_records': 12963,
    'frac_overlapped': 0.9},
   'test': {'n_records': 3600,
    'n_overlapped_records': 3245,
    'frac_overlapped': 0.9}}}}

To run a VFL training session on the linked datasets, see VFL SplitNN Model Training).

To perform exploratory data analysis on the intersection, see EDA Intersect.

Back to VFL Model Training

VFL SplitNN Model Training

In a vertical federated learning (VFL) process, two or more parties collaboratively train a model using datasets that share a set of overlapping features. Each party has partial information about the overlapped subjects in the dataset. Therefore, before running a VFL training session, a private record linkage (PRL) session is performed to find the intersection and create alignment between datasets.

There are two types of parties participating in the training:

The Active Party owns the labels, and may or may not also contribute data.
The Passive Party contributes only data.

VFL Session Overview

A hospital may have patient blood tests and outcome information on cancer, but imaging data is owned by an imaging centre. They want to collaboratively train a model for cancer diagnosis based on the imaging data and blood test data. The hospital (active party) would own the outcome and patient blood tests and the Imaging Centre (passive party) would own the imaging data.

A simplified model of the process is shown below.

integrate.ai VFL Flow

The following diagram outlines the training flow in the integrate.ai implementation of VFL.

VFL Training Session Example

Use the integrateai_taskrunner_AWS.ipynb notebook to follow along and test the examples shown below by filling in your own variables as required. You can download sample notebooks here.

The notebook demonstrates a sequential workflow for the PRL session and the VFL train and predict sessions.

Complete the Environment Setup

model_config = {
    "strategy": {"name": "SplitNN", "params": {}},
    "model": {
        "feature_models": {
            "passive_client": {"params": {"input_size": 7, "hidden_layer_sizes": [6], "output_size": 5}},
            "active_client": {"params": {"input_size": 8, "hidden_layer_sizes": [6], "output_size": 5}},
        },
        "label_model": {"params": {"hidden_layer_sizes": [5], "output_size": 2}},
    },
    "ml_task": {
        "type": "classification",
        "params": {
            "loss_weights": None,
        },
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.2, "momentum": 0.0}},
    "seed": 23,  # for reproducibility
}

data_config = {
        "passive_client": {
            "label_client": False,
            "predictors": ["x1", "x3", "x5", "x7", "x9", "x11", "x13"],
            "target": None,
        },
        "active_client": {
            "label_client": True,
            "predictors": ["x0", "x2", "x4", "x6", "x8", "x10", "x12", "x14"],
            "target": "y",
        },
    }

To create a VFL train session, specify the prl_session_id indicating the session you just ran to link the datasets together. The vfl_mode needs to be set to train.

Ensure that you have run a PRL session to obtain the aligned dataset. The PRL session ID is required for the VFL training session.
Create a model_config and a data_config for the VFL session.

Parameters:

strategy: Specify the name and parameters. For VFL, the strategy is SplitNN.
model: Specify the feature_models and label_model.
- feature_models refers to the part of the model that transforms the raw input features into intermediate encoded columns (usually hosted by both parties).
- label_model refers to the part of the model that connects the intermediate encoded columns to the target variable (usually hosted by the active party).
ml_task: Specify the type of machine learning task, and any associated parameters. Options are classification or regression.
optimizer: Specify any optimizer supported by PyTorch.
seed: Specify a number.

Create and start a VFL training session

vfl_train_session = client.create_vfl_session(
    name="Testing notebook - VFL Train",
    description="I am testing VFL Train session creation with a task runner through a notebook",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_ffnet",
    data_config=data_config,
    model_config=model_config
).start()

vfl_train_session.id

Specify the PRL session ID of a succssful PRL session that used the same active and passive client names.

Ensure that the vfl_mode is set to train.

Parameters:

Set up the task builder and task group for VFL training

storage_path = f"{aws_storage_path}/vfl/{vfl_train_session.id}"

vfl_task_group_context = (SessionTaskGroup(vfl_train_session)\
    .add_task(iai_tb_aws.vfl_train(train_dataset_name="active_train", test_dataset_name="active_test", 
                                    batch_size=16384 * 32,
                                    eval_batch_size=5000000, 
                                    job_timeout_seconds=28800, 
                                    client_name="active_client", 
                                    storage_path = storage_path))\
    .add_task(iai_tb_aws.vfl_train(train_dataset_name="passive_train", test_dataset_name="passive_test", 
                                    batch_size=16384 * 32,
                                    eval_batch_size=5000000,
                                    job_timeout_seconds=28800,
                                    client_name="passive_client", 
                                    storage_path = storage_path))\
    .start())

Create a task in the task group for each client. The number of client tasks in the task group must match the number of clients specified in the data_config used to create the session.

The following parameters are required for each client task:

train_dataset_name OR train_path
test_dataset_name OR test_path
batch_size: specify a value for batching the train dataset during training. The default value of 1024 is meant for use only with small datasets (~100MB).
eval_batch_size: specify a value for batching the test dataset during training with large datasets.
job_timeout_seconds: specify a value that corresponds to the amount of time the training session is estimated to take. For large datasets, the default value of 7200 must be increased.
client_name: must match the client_name specified in the PRL session used to determine the overlap
storage_path: model storage location

If you are using a registered dataset, use the following example pattern to add a client task:

.add_task(iai_tb_aws.vfl_train(train_dataset_name="train_name", test_dataset_name="test_name", batch_size=1024, client_name="active_client", storage_path = storage_path))

Monitor submitted VFL training jobs

# Check the status of the tasks
for i in vfl_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

vfl_task_group_context.monitor_task_logs()

# Wait for the tasks to complete (success = True)
vfl_task_group_context.wait(60*5, 2)

Each task in the task group kicks off a job in AWS Batch. You can monitor the jobs through the notebook, as shown.

View the VFL training metrics

# Example results

{'session_id': '498beb7e6a',
 'federated_metrics': [{'loss': 0.6927943530912943},
  {'loss': 0.6925891094472265},
  {'loss': 0.6921983339753467},
  {'loss': 0.6920029462394067},
  {'loss': 0.6915351291650617}],
 'client_metrics': [{'user@integrate.ai:79704ac8c1a7416aa381288cbab16e6a': {'test_roc_auc': 0.5286237121001411,
    'test_num_examples': 3245,
    'test_loss': 0.6927943530912943,
    'test_accuracy': 0.5010785824345146}},
  {'user@integrate.ai:79704ac8c1a7416aa381288cbab16e6a': {'test_num_examples': 3245,
    'test_accuracy': 0.537442218798151,
    'test_roc_auc': 0.5730010669487545,
    'test_loss': 0.6925891094472265}},
  {'user@integrate.ai:79704ac8c1a7416aa381288cbab16e6a': {'test_accuracy': 0.550693374422188,
    'test_roc_auc': 0.6073282812853845,
    'test_loss': 0.6921983339753467,
    'test_num_examples': 3245}},
  {'user@integrate.ai:79704ac8c1a7416aa381288cbab16e6a': {'test_loss': 0.6920029462394067,
    'test_roc_auc': 0.6330078151716465,
    'test_accuracy': 0.5106317411402157,
    'test_num_examples': 3245}},
  {'user@integrate.ai:79704ac8c1a7416aa381288cbab16e6a': {'test_roc_auc': 0.6495852274713467,
    'test_loss': 0.6915351291650617,
    'test_accuracy': 0.5232665639445301,
    'test_num_examples': 3245}}]}

Once the session completes successfully, you can view the training metrics.

metrics = vfl_train_session.metrics().as_dict() metrics

Plot the VFL training metrics.

fig = vfl_train_session.metrics().plot()

Example of plotted training metrics

Back to VFL Model Training overview.

VFL Prediction Session Example

This example continues the workflow in the previous sections: PRL Session Example and VFL Training Session Example.

Create and start a VFL prediction session

# Example configuration of a VFL predict session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL Predict",
    description="I am testing VFL Predict session creation with an AWS task runner through a notebook",
    prl_session_id=prl_session.id,
    training_session_id=vfl_train_session.id,
    vfl_mode="predict",
    data_config=data_config
).start()

vfl_predict_session.id

To create a VFL prediction session, specify the PRL session ID (prl_session_id) and the VFL train session ID (training_session_id) from your previous succesful PRL and VFL sessions.

Set the vfl_mode to predict.

Specify the full path for the storage location for your predictions, including the file name.

Note: by default the task runner creates a bucket for you to upload data into (e.g. s3://{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai). Only the default S3 bucket and other buckets ending in *integrate.ai are supported. If you are not using the default bucket created by the task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.

#Where to store VFL predictions - must be full path and file name
vfl_predict_active_storage_path = f's3://{base_aws_bucket}/vfl_predict/active_predictions'
vfl_predict_passive_storage_path = f's3://{base_aws_bucket}/vfl_predict/passive_predictions'

Create and start a task group for the prediction session

storage_path = f"{vfl_predict_active_storage_path}_{vfl_predict_session.id}.csv"

vfl_predict_task_group_context = (SessionTaskGroup(vfl_predict_session)\
.add_task(iai_tb_aws.vfl_predict(
        client_name="active_client", 
        dataset_path=active_test_path, 
        raw_output=True,
        batch_size=1024, 
        storage_path = storage_path))\
.add_task(iai_tb_aws.vfl_predict(
        client_name="passive_client",
        dataset_path=passive_test_path,
        batch_size=1024,
        raw_output=True,
        storage_path = storage_path))\
.start())

Create a task in the task group for each client. The number of client tasks in the task group must match the number of clients specified in the data_config used to create the PRL and VFL train sessions.

The following parameters are required for each client task:

client_name - must be the same as the client name specified in the PRL and VFL train sessions
dataset_path - the name of a registered dataset
batch_size - set to a default value
raw_output - raw_output (bool, optional): whether the raw model output should be saved. Defaults to False, in which case, a transformation corresponding to the ml task is applied.
storage_path - must be the full path and file name

Monitor submitted VFL prediction jobs

# Example of monitoring tasks

for i in vfl_predict_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

vfl_predict_task_group_context.monitor_task_logs()

# Wait for the tasks to complete (success = True)

vfl_predict_task_group_context.wait(60*5, 2)

Each task in the task group kicks off a job in AWS Batch. You can monitor the jobs through the notebook, as shown.

View VFL predictions

# Retrieve the metrics

metrics = vfl_predict_session.metrics().as_dict()
metrics

After the predict session completes successfully, you can view the predictions from the Active party and evaluate the performance.

presigned_result_urls = vfl_predict_session.prediction_result()

print(vfl_predict_active_storage_path)
df_pred = pd.read_csv(presigned_result_urls.get(storage_path))

df_pred.head()

Example output:

Back to VFL Model Training

Gradient Boosted Models for VFL

Gradient boosting is a machine learning algorithm for building predictive models that helps minimize the bias error of the model. The gradient boosting model for VFL provided by integrate.ai uses the sklearn implementation of HistGradientBoostingClassifier for classifier tasks and HistGradientBoostingRegresssor for regression tasks.

The VFL-GBM sample notebook (integrateai_taskrunner_vfl_gbm.ipynb) provides sample code for running the SDK, and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information.

Prerequisites

Open the integrateai_taskrunner_vfl_gbm.ipynb notebook to test the code as you walk through this tutorial.
Download the sample dataset to use with this tutorial. The sample files are: active_train.parquet - training data for the active party active_test.parquet - test data for the active party, used when joining a session passive_train.parquet - training data for the passive party passive_test.parquet - test data for the passive party, used when joining a session

Run a PRL session to align the datasets for VFL-GBM

Before you run a VFL session, you must first run a PRL session to align the datasets. The sample notebook provides two examples of running a PRL session with different match thresholds.

For more information, see Private Record Linkage (PRL) Sessions.

Review the sample Model Configuration for GBM

model_config = {
    "strategy": {"name": "VflGbm"},
    "model": {
        "params": {
            "max_depth": 4,
            "learning_rate": 0.05,
            "random_state": 23,
            "max_bins": 255,
        }
    },
    "ml_task": {"type": 'classification', "params": {}},
}

data_config = {
    "passive_client": {
        "label_client": False,
        "predictors": ["x1", "x3", "x5", "x7", "x9", "x11", "x13"],
        "target": None,
    },
    "active_client": {
        "label_client": True,
        "predictors": ["x0", "x2", "x4", "x6", "x8", "x10", "x12", "x14"],
        "target": "y",
    },
}

integrate.ai has a model class available for Gradient Boosted Models, called iai_gbm. This model is defined using a JSON configuration file during session creation.

The strategy for VFL-GBM is VflGbm. Note that this is different than the strategy name for an HFL GBM session, which is HistGradientBoosting.

You can adjust the following parameters as needed:

max_depth - Used to control the size of the trees.
learning_rate - (shrinkage) Used as a multiplicative factor for the leaves values. Set this to one (1) for no shrinkage.
random_state - Pseudo-random number generator to control the subsampling in the binning process, and the train/validation data split if early stopping is enabled. Pass an int for reproducible output across multiple function calls. Set to RandomState instance or None. The default is None.
max_bins - The number of bins used to bin the data. Using less bins acts as a form of regularization. It is generally recommended to use as many bins as possible.

For more information, see the scikit documentation for HistGradientBoostingClassifier.

Set the machine learning task type to either classification or regression.

Specify any parameters associated with the task type in the params section.

The notebook also provides a sample data schema. For the purposes of testing VFL-GBM, use the sample schema as shown.

Create a VFL-GBM training session

vfl_train_session = client.create_vfl_session(
    name="Testing notebook - VFL-GBM training",
    description="I am testing VFL GBM training session creation through a notebook",
    prl_session_id=prl_session.id,
    vfl_mode="train",
    min_num_clients=2,
    num_rounds=5,
    package_name="iai_ffnet",
    data_config=data_config,
    model_config=model_config,
).start()

vfl_train_session.id

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session.

Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and five rounds. It returns a session ID that you can use to track and reference your session.

Create and start a task group to run the training session

# Create and start a task group with one task for each of the clients joining the session

vfl_task_group_context = (SessionTaskGroup(vfl_train_session)\
    .add_task(iai_tb_aws.vfl_train(train_path=active_train_path, 
                                    test_path=active_test_path, 
                                    batch_size=1024,  
                                    client_name="active_client", 
                                    storage_path=aws_storage_path))\
    .add_task(iai_tb_aws.vfl_train(train_path=passive_train_path, 
                                    test_path=passive_test_path, 
                                    batch_size=1024, 
                                    client_name="passive_client", 
                                    storage_path=aws_storage_path))\
    .start())

Each client must have a unique name that matches the name specified in the data_config. For example, active_client and passive_client.

Poll for Session Results

# Check the status of the tasks

for i in vfl_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

vfl_task_group_context.monitor_task_logs()

vfl_task_group_context.wait(60*5, 2)

Sessions may take some time to run depending on the computer environment. In the sample notebook and this tutorial, we poll the server to determine the session status.

View the VFL-GBM training metrics

metrics = vfl_train_session.metrics().as_dict()
metrics

fig = vfl_train_session.metrics().plot()

After the session completes successfully, you can view the training metrics and start making predictions. 1. Retrieve the model metrics as_dict. 2. Plot the metrics.

VFL-GBM Prediction

# Create and start a VFL predict session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL Predict",
    description="I am testing VFL Predict session creation with an AWS task runner through a notebook",
    prl_session_id=prl_session.id,
    training_session_id=vfl_train_session.id,
    vfl_mode="predict",
    data_config=data_config,
).start()

vfl_predict_session.id

After you have completed a successful PRL and VFL train session, you can use those sessions to create a VFL prediction session.

For more information about VFL predict mode, see VFL Prediction Session Example.

Back to VFL Model Training

Generalized Linear Models for VFL (VFL-GLM)

The AWS task runner sample notebook (integrateai_taskrunner_AWS.ipynb) provides sample code for running the SDK, and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information.

Prerequisites

Open the integrateai_taskrunner_AWS.ipynb notebook to test the code as you walk through this tutorial.
Make sure you have downloaded the sample files for VFL: https://s3.ca-central-1.amazonaws.com/public.s3.integrate.ai/integrate_ai_examples/vfl.zip and uploaded them to your S3 bucket.
Define the dataset paths as shown in the sample notebook.

Note: By default the task runner creates a bucket for you to upload data into (e.g. s3://{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai). Only the default S3 bucket and other buckets ending in *integrate.ai are supported. If you are not using the default bucket created by the task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.

aws_taskrunner_profile = "<myworkspace>" # This is your workspace name
aws_taskrunner_name = "<mytaskrunner>" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

# Example datapaths. Make sure that the data you want to work with exists in the base_aws_bucket for your task runner.
# Sample PRL/VFL datapaths
active_train_path = f's3://{base_aws_bucket}/vfl/active_train.parquet'
active_test_path = f's3://{base_aws_bucket}/vfl/active_test.parquet'
passive_train_path = f's3://{base_aws_bucket}/vfl/passive_train.parquet'
passive_test_path = f's3://{base_aws_bucket}/vfl/passive_test.parquet'

#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'

#Where to store VFL predictions - must be full path and file name
vfl_predict_active_storage_path = f's3://{base_aws_bucket}/vfl_predict/active_predictions.csv'
vfl_predict_passive_storage_path = f's3://{base_aws_bucket}/vfl_predict/passive_predictions.csv'

Run a PRL session to align the datasets for VFL-GLM

Before you run a VFL session, you must first run a PRL session to align the datasets.

For more information, see Private Record Linkage (PRL) Sessions.

Review the sample Model Configuration for VFL-GLM

model_config = {
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        "passive_client": {"params": {"input_size": 7, "output_activation": "sigmoid"}},
        "active_client": {"params": {"input_size": 8, "output_activation": "sigmoid"}},
    },
    "ml_task": {
        "type": "logistic",
        "params": {},
    },
    "optimizer": {"name": "SGD", "params": {"learning_rate": 0.2, "momentum": 0.0}},
    "seed": 23,  # for reproducibility
}

data_config = {
        "passive_client": {
            "label_client": False,
            "predictors": ["x1", "x3", "x5", "x7", "x9", "x11", "x13"],
            "target": None,
        },
        "active_client": {
            "label_client": True,
            "predictors": ["x0", "x2", "x4", "x6", "x8", "x10", "x12", "x14"],
            "target": "y",
        },
    }

integrate.ai has a model class available for Generalized Linear Models, called iai_glm. This model is defined using a JSON configuration file during session creation (model_config).

You can adjust the following model_config parameters as needed:

ml_task type - You can chose between the following tasks: logistic, normal, poisson, gamma, inverseGaussian and tweedie. Tweedie has an additional power parameter to control the underlying target distribution.
output_activation - You can chose between the following activation functions: sigmoid, tanh and exp.
optimizer - We suport all pytorch optimizers
input_size - The number of features in the input data of each client.
seed - For reproducibility.

For more information, see the scikit documentation for Generalized Linear Models, or Generalized linear model on Wikipedia.

The notebook also provides a sample data schema. For the purposes of testing VFL-GLM, use the sample schema as shown.

Create a VFL-GLM training session

vfl_train_session = client.create_vfl_session(
    name="Testing notebook - VFL GLM Train",
    description="I am testing VFL GLM training session creation with a task runner through a notebook",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config
).start()

vfl_train_session.id

Federated learning models created in integrate.ai are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session. Create a session each time you want to train a new model.

The code sample demonstrates creating and starting a session with two training clients (two datasets) and two (2) rounds. It returns a session ID that you can use to track and reference your session.

Create and start a task group to run the training session

# Create and start a task group with one task for each of the clients joining the session

vfl_task_group_context = (SessionTaskGroup(vfl_train_session)\
    .add_task(iai_tb_aws.vfl_train(train_path=active_train_path, 
                                    test_path=active_test_path, 
                                    batch_size=1024,

                                    client_name="active_client", 
                                    storage_path=aws_storage_path))\
    .add_task(iai_tb_aws.vfl_train(train_path=passive_train_path, 
                                    test_path=passive_test_path, 
                                    batch_size=1024, 
                                    client_name="passive_client", 
                                    storage_path=aws_storage_path))\
    .start())

Each client must have a unique name that matches the name specified in the data_config. For example, active_client and passive_client.

Poll for Session Results

# Check the status of the tasks
for i in vfl_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))
vfl_task_group_context.monitor_task_logs()

# Wait for the session to complete
vfl_task_group_context.wait(60*8, 2)

Sessions may take some time to run depending on the computer environment. In the sample notebook and this tutorial, we poll the server to determine the session status.

View the training metrics

metrics = vfl_train_session.metrics().as_dict()
metrics

fig = vfl_train_session.metrics().plot()

After the session completes successfully, you can view the training metrics and start making predictions.

Retrieve the model metrics as_dict.
Plot the metrics.

VFL-GLM Prediction

# Create and start a VFL-GLM prediction session

vfl_predict_session = client.create_vfl_session(
    name="Testing notebook - VFL-GLM Predict",
    description="I am testing VFL-GLM prediction session creation with an AWS task runner through a notebook",
    prl_session_id=prl_session.id,
    training_session_id=vfl_train_session.id,
    vfl_mode="predict",
    data_config=data_config,
).start()

vfl_predict_session.id

# Create and start a task group with one task for each of the clients joining the session

vfl_predict_task_group_context = (SessionTaskGroup(vfl_predict_session)\
.add_task(iai_tb_aws.vfl_predict(
        client_name="active_client", 
        dataset_path=active_test_path, 
        raw_output=True,
        batch_size=1024, 
        storage_path=vfl_predict_active_storage_path))\
.add_task(iai_tb_aws.vfl_predict(
        client_name="passive_client",
        dataset_path=passive_test_path,
        batch_size=1024,
        raw_output=True,
        storage_path=vfl_predict_passive_storage_path))\
.start())

After you have completed a successful PRL session and a VFL-GLM train session, you can use those two sessions to create a VFL-GLM prediction session.

In the example, the sessions are run in sequence, so the session IDs for the PRL and VFL train sessions are readily available to use in the predict session. If instead you run a PRL session and want to reuse the session later in a different VFL session, make sure that you save the session ID (prl_session.id). Then you can provide the session ID directly in the predict session setup instead of relying on the variable. The 3 sessions must use the same client_name and data in order to run successfully.

# Check the status of the tasks
for i in vfl_predict_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))
vfl_predict_task_group_context.monitor_task_logs()

# Wait for the tasks to complete (success = True)
vfl_predict_task_group_context.wait(60*5, 2)

View the training metrics

# Retrieve the metrics
metrics = vfl_predict_session.metrics().as_dict()
metrics

import pandas as pd
presigned_result_urls = vfl_predict_session.prediction_result()

df_pred = pd.read_csv(presigned_result_urls.get(vfl_predict_active_storage_path))

df_pred.head()

Now you can view the predictions and evaluate the performance.

For more information about VFL predict mode, see VFL Prediction Session Example.

Back to VFL Model Training

DATA ANALYSIS

EDA in Individual Mode

The Exploratory Data Analysis (EDA) feature for horizontal federated learning (HFL) enables you to access summary statistics about a group of datasets without needing access to the data itself. This allows you to get a basic understanding of the dataset when you don't have access to the data or you are not allowed to do any computations on the data.

EDA is an important pre-step for federated modelling and a simple form of federated analytics. The feature has a built in differential privacy setting. Differential privacy (DP) is dynamically added to each histogram that is generated for each feature in a participating dataset. The added privacy protection causes slight noise in the end result.

The sample notebook (integrate_ai_api.ipynb) provides runnable code examples for exploring the API, including the EDA feature, and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information to expand on the code demonstration.

API Reference for EDA

The core API module that contains the EDA-specific functionality is integrate_ai_sdk.api.eda. This module includes a core object called EdaResults, which contains the results of an EDA session.

If you are comfortable working with the integrate.ai SDK and API, see the API Documentation for details.

This tutorial assumes that you have correctly configured your environment for working with integrate.ai, as described in Using integrate.ai.

Configure an EDA Session

To begin exploratory data analysis, you must first create a session.

The eda_data_config file is a configuration file that maps the name of one or more datasets to the columns to be pulled. Dataset names and column names are specified as key-value pairs in the file.

For each pair, the keys are dataset names that are expected for the EDA analysis. The values are a list of corresponding columns. The list of columns can be specified as column names (strings), column indices (integers), or a blank list to retrieve all columns from that particular dataset.

If a dataset name is not included in the configuration file, all columns from that dataset are used by default.

For example:

To retrieve all columns for a submitted dataset named dataset_one:

eda_data_config = {"dataset_one": []}

To retrieve the first column and the column x2 for a submitted dataset named dataset_one:

eda_data_config = {"dataset_one": [1,"x2"]}

To retrieve the first column and the column x2 for a submitted dataset named dataset_one and all columns in a dataset named dataset_two:

eda_data_config = {"dataset_one": [1,"x2"],"dataset_two": []}

Specifying the number of datasets

You can manually set the number of datasets that are expected to be submitted for an EDA session by specifying a value for num_datasets.

If num_datasets is not specified, the number is inferred from the number of datasets provided in eda_data_config.

Create and start an EDA session

Create the EDA intersect session

eda_data_config = {"dataset_one": [1,"x5","x7"], "dataset_two": ["x1","x10","x11"]}

eda_session = client.create_eda_session(
    name="EDA Individual Session",
    description="Testing EDA individual mode through a notebook",
    data_config=eda_data_config,
    eda_mode="individual",               #Generates histograms on single nodes
    single_client_2d_pairs = None,      #Optional - only required to 2d histograms 
).start()

eda_session.id

The code sample demonstrates creating and starting an EDA session to perform privacy-preserving data analysis on the intersection of two distinct datasets. It returns an EDA session ID that you can use to track and reference your session.

Session parameters:

eda_mode = One of {'individual','intersect'}. Defaults to 'individual'.
single_client_2d_pairs (Dict, optional): a data_config like dict with the column names to use to generate 2d-histograms when both columns belong to same dataset. This option only considers the single node 2d-histograms and is valid for both 'intersect' and 'individual' mode. Defaults to None.

Note that, unlike other sessions, you do not need to specify a model_config file for EDA. This is because there are no editable parameters.

For more information, see the create_eda_session() definition in the API documentation.

The code sample demonstrates creating and starting an EDA session to perform privacy-preserving data analysis on two datasets, named dataset_one and dataset_two. It returns an EDA session ID that you can use to track and reference your session.

The eda_data_config used here specifies that the first column (1), column x5, and column x7 will be analyzed for dataset_one and columns x1, x10, and x11 will be analyzed for dataset_two.

Since the num_datasets argument is not provided to client.create_eda_session(), the number of datasets is inferred as two from the `eda_data_config.

For more information, see the create_eda_session() definition in the API documentation.

Analyze the EDA results

The results object is a dataset collection comprised of multiple datasets that can be retrieved by name. Each dataset is comprised of columns that can be retrieved by either column name or by index.

You can perform the same base analysis functions at the collection, dataset, or column level.

results = eda_session.results()

*Example output:

EDA Session collection of datasets: ['dataset_two', 'dataset_one']

Describe

Use the .describe() method to retrieve a standard set of descriptive statistics.

If a statistical function is invalid for a column (for example, mean requires a continuous column and x10 is categorical), or the column from one dataset is not present in the other (for example, here x5 is in dataset_one, but not dataset_two), then the result is NaN.

results.describe()

results["dataset_one"].describe()

Statistics

For categorical columns, you can use other statistics for further exploration. For example, unique_count, mode, and uniques.

results["dataset_one"][["x10", "x11"]].uniques()

You can call functions such as .mean(), .median(), and .std() individually.

results["dataset_one"].mean()

results["dataset_one"]["x1"].mean()

Histograms

You can create histogram plots using the .plot_hist() function.

saved_dataset_one_hist_plots = results["dataset_two"].plot_hist()

single_hist = results["dataset_two"]["x1"].plot_hist()

EDA in Intersect Mode

The Exploratory Data Analysis (EDA) Intersect mode enables you to access summary statistics about the intersection of a group of datasets without needing access to the data itself. It is used primarily with VFL use cases to understand statistics about the overlapped records. For example, you may be interested in "what is the distribution of age among the intersection" (the intersection mode), which could be a very different answer from “what is the distribution of age among the whole population” (the individual mode).

With EDA Intersect, you can:

obtain descriptive statistics, as in EDA Individual mode, but on the intersection (i.e., overlapping indices)
use bivariate operations such as groupby or correlation
compute the correlation matrix
plot a 2D histogram based on the joint distribution of two features from different clients

EDA is an important pre-step for federated modelling and a simple form of federated analytics. This feature has a built-in mechanism to achieve differential privacy. Proper noise is dynamically added to each histogram that is generated for each feature in a participating dataset. This adds extra protection on the raw data by making the final results differentially private, but at the same time it means the final results will deviate slightly from the true values.

API Reference

The core API module that contains the EDA-specific functionality is integrate_ai_sdk.api.eda. This module includes a core object called EdaResults, which contains the results of an EDA session. If you are comfortable working with the integrate.ai SDK and API, see the API Documentation for details.

Configure an EDA Intersect Session

Example data paths in S3

aws_taskrunner_profile = "<myworkspace>" # This is your workspace name
aws_taskrunner_name = "<mytaskrunner>" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

active_train_path = f's3://{base_aws_bucket}/vfl/active_train.parquet'
active_test_path = f's3://{base_aws_bucket}/vfl/active_test.parquet'
passive_train_path = f's3://{base_aws_bucket}/vfl/passive_train.parquet'
passive_test_path = f's3://{base_aws_bucket}/vfl/passive_test.parquet'

#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'

eda_data_config = 
{
    "active": {"columns": [],},  
    "passive": { "columns": [],},

prl_session_id = "<PRL_SESSION_ID>"

This example uses an AWS task runner to run the session using data in S3 buckets. Ensure that you have completed the AWS configuration for task runners and that a task runner exists. See Create an AWS task runner for details.

To begin exploratory data analysis, you must first create a session. To configure the session, specify the following:

EDA data configuration (eda_data_config)
prl_session_id for the PRL session you want to work with

The eda_data_config specifies the names of the datasets used to generate the intersection in the PRL session in the format dataset_name : columns. If columns is empty ([]), then EDA is performed on all columns.

You must also specify the session ID of a previous successful PRL session that used the same datasets listed in the eda_data_config.

Paired Columns

single_client_2d_pairs = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}

cross_client_2d_pairs = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}

To find the correlation (or any other binary operation) between two specific columns between two datasets, you must specify those columns as paired columns.

To set which pairs you are interested in, specify their names in a dictionary like eda_data_config.

For example: {"passive_client": ['x1', 'x5'], "active_client": ['x0', 'x2']}

will generate 2D histograms for these pairs of columns:

(x0, x1), (x0, x5), (x2, x1), (x2, x5), (x0, x2), (x1, x5)

Create an EDA intersect session

Create the EDA intersect session

eda_session = client.create_eda_session(
    name="EDA Intersect Session",
    description="Testing EDA Intersect mode through a notebook",
    data_config=eda_data_config,
    eda_mode="intersect",               #Generates histograms on an overlap of two distinct datasets
    prl_session_id=prl_session_id,      #Required for intersect mode
    single_client_2d_pairs = None,      #Optional - only required to generate single node 2D histograms (both columns in same dataset)
    cross_client_2d_pairs = None       #Optional - only required to generate 2D histograms (between two datasets)
).start()

eda_session.id

Session parameters: * eda_mode = One of {'individual','intersect'}. Defaults to 'individual'. * prl_session_id (str, optional): Session ID of associated PRL session. Required for eda_mode = 'intersect'. * single_client_2d_pairs (Dict, optional): a data_config like dict with the column names to use to generate 2d-histograms when both columns belong to same dataset. This option only considers the single node 2d-histograms and is valid for both 'intersect' and 'individual' mode. Defaults to None. * cross_client_2d_pairs (Dict, optional): a data_config like dict with the column names to use to generate cross-dataset 2d-histograms. This option only works for 'intersect' mode. Defaults to None.

Note that, unlike other sessions, you do not need to specify a model_config file for EDA. This is because there are no editable parameters.

For more information, see the create_eda_session() definition in the API documentation.

Create and run the task group

# Create a task group with one task for each of the clients joining the session. If you have registered your dataset(s), specify only the `dataset_name` (no `dataset_path`). 

eda_task_group_context = (
        SessionTaskGroup(eda_session) \
        .add_task(iai_tb_aws.eda(dataset_name="active_train" dataset_path="active_train_path"))\
        .add_task(iai_tb_aws.eda(dataset_name="passive_train", dataset_path=passive_train_path"))\
    )

eda_task_group_context = task_group.start()

Create a task in the task group for each client. The number of tasks in the task group must match the number of clients specified in the data_config used to create the session.

Monitor submitted EDA Intersect jobs

for i in eda_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

eda_task_group_context.monitor_task_logs()

# Wait for the tasks to complete (success = True)

eda_task_group_context.wait(60*5, 2)

Submitted tasks are in the Pending state until the clients join and the session is started. Once started, the status changes to Running.

When the session completes successfully, "True" is returned. Otherwise, an error message appears.

Analyze the results

To retrieve the results of the session:

results = eda_session.results()

Example output:

EDA Session collection of datasets: ['active_client', 'passive_client']

Describe

You can use the .describe() function to review the results.

results.describe()

Example output:

Statistics

For categorical columns, you can use other statistics for further exploration. For example, unique_count, mode, and uniques.

results["active_client"][["x10", "x11"]].uniques()

Example output:

Mean

You can call functions such as .mean(), .median(), and .std() individually.

results["active_client"].mean()

Example output:

Histograms

You can create histogram plots using the .plot_hist() function.

saved_dataset_one_hist_plots = results["active_client"].plot_hist()

Example output:

single_hist = results["active_client"]["x10"].plot_hist()

Example output:

2D Histograms

You can also plot 2D-histograms of specified paired columns.

fig = results.plot_hist(active_client['x0'], passive_client['x1'])

Example output:

Correlation

You can perform binary calculations on columns specified in paired_columns, such as finding the correlation.

results.corr(active_client['x0'], passive_client['x1'])

Example output:

Addition, subtraction, division

Addition example. Change the operator to try subtraction, division, etc.

op_res = active_client['x0']+passive_client['x1'] fig = op_res.plot_hist()

Example output:

GroupBy

groupby_result = results.groupby(active_client['x0'])[passive_client['x5']].mean() print (groupby_result)

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that helps rule out features that are not important to the model. In other words, it generates a transformation to apply on a dataset to simplify the number of features used while retaining enough information in the features to represent the original dataset.

The main use of PCA is to help researchers uncover the internal pattern underlying the data. This pattern is then helpful mainly in two applications:

Helps discover new groups and clusters among patients
More efficient modelling: achieving the same or even better predictive power with less input and simpler models.

The iai_pca session outputs a transformation matrix that is generated in a distributed manner. This transformation can be applied to each dataset of interest.

The underlying technology of this session type is FedSVD (federated singular value decomposition), which can be used in other dimensionality reduction methods.

For a real-world example of using a PCA session, see the pca_face_decomposition.ipynb notebook available in the sample repository. Sample data for running this notebook is available in a zip file here.

Configure a PCA Session

model_config = {
    "experiment_name": "test_pca",
    "experiment_description": "test_pca",
    "strategy": {"name": "FedPCA", "params": {}},
    "model": {
        "params": {
            "n_features": 10,   # optional
            "n_components": 3,
            "whiten": False
        }
    },
}

data_config = {
    "predictors": []
}

This tutorial assumes that you have correctly configured your environment for working with integrate.ai, as described in Installing the SDK and Using integrate.ai. It uses a locally run Docker container for the client. integrate.ai manages the server component.

This example describes how to train an HFL session with the built-in package iai_pca, which performs the principal component analysis on the raw input data, and outputs the desired number of principal axes in the feature space along with the corresponding eigenvalues. The resulting model can be used to transform data points from the raw input space to the principal component space.

Sample model config and data config for PCA

In the model_config, use the strategy FedPCA with the iai_pca package.

There are three parameters that can be configured:

n_features: the number of input features. Optional. If not specified, it is inferred based on the data config.
n_components: the number of principal components to consider. Can be an integer or a fraction between (0, 1). In the latter case, the number will be selected such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
whiten: When True (False by default) the principal axes will be modified to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometimes improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

The example performs PCA on the 10 input features and outputs the first 3 principal axes.

Create and start a PCA training session

pca_session = client.create_fl_session(
    name="Testing PCA ",
    description="Testing a PCA session through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_pca",
    model_config=model_config,
    data_config=data_config,
).start()

pca_session.id

Note that the num_rounds parameter is ignored when iai_pca is used, because PCA sessions determine the number of rounds required automatically.

Make sure that the sample data you downloaded is either saved to your ~/Downloads directory, or you update the data_path below to point to the sample data.

data_path = "~/Downloads/synthetic"

Create a task group for the two clients.

from integrate_ai_sdk.taskgroup.taskbuilder import local
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

tb = local.local_docker(
    client,
    docker_login=False,
)

task_group_context = (
    SessionTaskGroup(pca_session)
    .add_task(tb.hfl(train_path=train_path1, test_path=test_path, client_name="client1"))\
    .add_task(tb.hfl(train_path=train_path2, test_path=test_path, client_name="client2"))\
    .start()
)

Wait for the session to complete successfully. This code will return True when it is complete.

task_group_context.wait(60 * 5)

View PCA session results

Note that the reverse process of mapping data points from the principal component space back to the original feature space can be treated as a multivariate regression task (i.e., reconstruct the raw features). We log some regression metrics (e.g., loss/mse and r2) for PCA sessions. They can be pulled as follows.

training_session.metrics().as_dict()

The PCA results can be retrieved as follows. It is a standard PyTorch model object, and the .state_dict() method can be used to show all stored parameters, including the principal axes matrix and eigenvalues.

pca_transformation = training_session.model().as_pytorch()
pca_transformation.state_dict()

We can also transform data points from the original feature space to the principal component space by directly calling the model object on the data tensors.

import pandas as pd

test_data = pd.read_parquet(f"{data_path}/test.parquet")
test_data.head()

import torch

test_data_tensor = torch.tensor(test_data[data_schema["predictors"]].values)
test_data_pc = pca_transformation(test_data_tensor)
test_data_pc.shape

DATA VALUATION

Data Consumers perform data evaluation with integrate.ai to understand how valuable a Data Provider's data product could be to their data science tasks or business problems.

There are two core questions Data Consumers typically seek to answer when evaluating a model or data product:

How much data from a data product is usable in reference to my internal data?
How relevant or useful is the data product to my data science task, use case, or project?

Our data valuation features help data scientists understand the value of a dataset as it relates to its contribution to a particular model.

integrate.ai provides features to address data valuation at two different levels:

Dataset Influence Score that quantifies and ranks the influence each dataset has on a model.
Feature Importance Score that quantifies and ranks features across contributing datasets, based on their influence on a global model.

We use the Shapley values from game theory and their related extensions to determine the scores.

Where they can be applied

The calculation of these scores is built into our vertical federated modeling sessions. Influence results are calculated during every round of the federated training session and are displayed with the results of the session to show how these scores trend over the training rounds. Feature importance is only evaluated once, at the end of the training.

The Data Influence Score and Feature Importance Score are applicable to regression/classification tasks for VFL. This applies to the following model types:

Feed forward neural nets (SplitNN)
Generalized linear models (GLM)

The integrateai_taskrunner_AWS.ipynb notebook contains an example of calculating and plotting the scores for both dataset influence and feature importance for a SplitNN VFL session. The example plots shown in this section are generated using this notebook and the sample datasets provided by integrate.ai.

Note: A completed successful PRL session (prl_session_id) with the appropriate overlap is required input for the VFL session. For more information about PRL sessions, see Private Record Linkage (PRL) sessions. For an overview of VFL SplitNN, see VFL SplitNN Model Training.

Dataset influence

Dataset influence measures the impact of an individual dataset (that is, a data product) on the performance of a machine learning model trained using the integrate.ai platform. By running dataset influence, a Data Consumer can understand to what extent a Data Provider's data product contributed to the performance of a machine learning model they trained. This feature can be used in conjunction with other evaluation jobs to help Data Consumers answer the question of data product relevance to their project.

This score is based on multi-round Shapley values. It can be shown as a percentage of overall influence of clients on the final model (shown in bar plot below) or as an absolute score (shown in line graph below). The absolute score can be plotted on a per round basis, to see how the influence stabilizes as training progresses.

Calculating Influence Scores

# Dataset influence scores are enabled by default.
    "influence_score": {
        "enable": True,
        "params": {
            "rel_thresh": 0.01,
            "max_eval_len": None
        }
    },

This feature is enabled by default.

To view influence scores, add the sample code shown to your VFL model_config.

Parameters:

rel_thresh is the relative threshold of performance metric for determining whether to stop evaluation early. The influence score is calculated by breaking down the total model improvement over a round and assigning them to different datasets. If the total model improvement is very small, then it is not necessary to calculate the influence. In this case you can stop evaluation early to speed up training. For example, the default 0.01 means when the model performance metric changes less than 1% in one particular round, stop influence evaluation.
max_eval_len is the maximum length of sub-combinations to consider. It determines whether to apply an approximation of the full Shapley score calculation. It takes values from 0, 1, ..., num_datasets. The smaller the value, the faster the calculation, while at the same time the less reliable the scores are. Default is None in which case, no approximation is performed.

Plotting Influence Scores

Add the following sample code to plot overall influence of clients on the final model.

vfl_train_influence_scores = vfl_train_session.influence_scores(all_metrics=False)
_ = vfl_train_influence_scores.plot()

This produces a bar plot.

Add the following sample code to plot the per round influence of clients on the final model.

_ = vfl_train_influence_scores.plot(per_round=True)

This produces a line graph.

Feature Importance

Data Consumers can use feature importance to understand how specific features in a Data Provider's data product impacted the performance of a machine learning model they trained. They can also compare features to the features in their own data. Note that if the data product consists of only one feature, we recommend using Dataset Influence instead of Feature Importance. Feature importance can be used in conjunction with other evaluation jobs to help Data Consumers answer the question of data product relevance to their project.

This score is based on the "SHapley Additive exPlanations" (SHAP) method and is single-round.

Use the SDK to plot this score as a beeswarm or a bar plot. The beeswarm plot represents the individual SHAP value for each data point in the feature. The bar plot shows the overall importance of a feature - the bigger the magnitude (absolute value), the more important the feature. The overall importance of a feature is the average of absolute SHAP value of the samples.

This feature is disabled by default.

Calculating Feature Importance Scores

# Feature importance scores are disabled by default. To enable, set "enable" to True.
    "feature_importance_score": {
        "enable": True,
        "params": {
            "max_evals": None,
            "subsample": None,
            "random_seed": 23,
        }
    },

To enable feature importance, add the sample code shown to your VFL model_config.

At a high level, feature importance is a type of sensitivity analysis. For one given real sample, we generate a set of “fake samples” by perturbing a subset of feature values, then check how the model output changes accordingly. The parameters determine the setting for this sensitivity analysis.

Parameters:

max_evals means the number of “fake samples” to generate for each real sample. The default is 10 * (2 * total_n_features + 1), which grows with number of features. You can cap it by setting a fixed integer (e.g., 500), but in that case when the number of features increases, the reliability of the importance score will decrease as you get less fake samples per features.
subsample controls how many real samples to consider from the test dataset. By default it considers all the samples. You can set a fraction to represent the percentage to select, or an integer to represent the actual number of samples to select.
random_seed controls the randomness of generating fake samples. This is separate from the random_seed set for the overall model training in the model_config.

Plotting Feature Importance Scores

To display the calculated feature importance scores, add the following sample code.

vfl_feature_importance = vfl_train_session.feature_importance_scores()
vfl_feature_importance.as_dict()["feature_importance_scores"]

Example output:

To plot the importance scores for the top 15 features, add the following sample code.

_ = vfl_feature_importance.plot(topk=15)

This produces a bar plot.

To generate the SHAP value box plots, add the following sample code.

_ = vfl_feature_importance.plot(style="box", topk=15)

This produces two box plots.

To list the feature importance score details, add the following sample code.

vfl_feature_importance.as_dict()["feature_importance_details"][0]["active_client"]

Example output:

REFERENCE

Model Packages

Feed Forward Neural Nets (iai_ffnet)

Feedforward neural nets are a type of neural network in which information flows the nodes in a single direction.

Examples of use cases include:

Classification tasks like image recognition or churn conversion prediction.
Regression tasks like forecasting revenues and expenses, or determining the relationship between drug dosage and blood pressure

HFL FFNet

The iai_ffnet model is a feedforward neural network for horizontal federated learning (HFL) that uses the same activation for each hidden layer.

This model only supports classification and regression. Custom loss functions are not supported.

Privacy

DP-SGD (differentially private stochastic gradient descent) is applied as an additional privacy-enhancing technology. The basic idea of this approach is to modify the gradients used in stochastic gradient descent (SGD), which lies at the core of almost all deep learning algorithms.

VFL SplitNN

integrate.ai also supports the SplitNN model for vertical federated learning (VFL). In this model, neural networks are trained with data across multiple clients. A PRL (private-record linking) session is required for all datasets involved. There are two types of sessions: train, and predict. To make predictions, the PRL session ID and the corresponding training session ID are required.

For more information, see PRL Session and VFL SplitNN Model Training.

Generalized Linear Models (GLMs)

This model class supports a variety of regression models. Examples include linear regression, logistic regression, Poisson regression, Gamma regression, Tweedie regression, and inverse Gaussian regression models. We also support regularizing the model coefficients with the elastic net penalty.

Examples of use cases include [1]:

Agriculture / weather modeling: number of rain events per year, amount of rainfall per event, total rainfall per year
Risk modeling / insurance policy pricing: number of claim events / policyholder per year, cost per event, total cost per policyholder per year
Predictive maintenance: number of production interruption events per year, duration of interruption, total interruption time per year

The iai_glm model trains generalized linear models by treating them as a special case of single-layer neural nets with particular output activation functions.

Privacy

References [1]: https://scikit-learn.org/stable/modules/linear_model.html#generalized-linear-regression

Strategy Library

Reference guide for available training strategies.

FedAvg

Federated Averaging strategy implemented based on https://arxiv.org/abs/1602.05629

Parameters

fraction_fit (float, optional): Fraction of clients used during training. Defaults to 0.1.
fraction_eval (float, optional): Fraction of clients used during validation. Defaults to 0.1.
min_fit_clients (int, optional): Minimum number of clients used during training. Defaults to 1.
min_eval_clients (int, optional): Minimum number of clients used during validation. Defaults to 1.
min_available_clients (int, optional): Minimum number of total clients in the system. Defaults to 1.
eval_fn (Callable[[Weights], Optional[Tuple[float, float]]], optional): Function used for validation. Defaults to None.
on_fit_config_fn (Callable[[int], Dict[str, Scalar]], optional): Function used to configure training. Defaults to None.
on_evaluate_config_fn (Callable[[int], Dict[str, Scalar]], optional): Function used to configure validation. Defaults to None.
accept_failures (bool, optional): Whether or not accept rounds containing failures. Defaults to True. initial_parameters (Weights, optional): Initial global model parameters.

FedAvgM

Federated Averaging with Momentum (FedAvgM) strategy https://arxiv.org/pdf/1909.06335.pdf

Parameters

Uses the same parameters as FedAvg as well as the following:

initial_parameters (Weights, optional): Initial global model parameters.
server_learning_rate (float): Server-side learning rate used in server-side optimization. Defaults to 1.0, which is the same as the vanilla FedAvg
server_momentum (float): Server-side momentum factor used for FedAvgM. Defaults to 0.0.
nesterov (bool): Enables Nesterov momentum. Defaults to False.

FedAdam

Adaptive Federated Optimization using Adam (FedAdam) https://arxiv.org/abs/2003.00295

Parameters

Uses the same parameters as FedAvg as well as the following:

initial_parameters (Weights, optional): Initial global model parameters.
eta (float, optional): Server-side learning rate. Defaults to 1e-1.
beta_1 (float, optional): Momentum parameter. Defaults to 0.9.
beta_2 (float, optional): Second moment parameter. Defaults to 0.99.
tau (float, optional): Controls the degree of adaptability for the algorithm. Defaults to 1e-3.

FedAdagrad

Adaptive Federated Optimization using Adagrad (FedAdagrad) strategy https://arxiv.org/abs/2003.00295

Parameters

Uses the same parameters as FedAvg as well as the following:

initial_parameters (Weights, optional): Initial global model parameters.
eta (float, optional): Server-side learning rate. Defaults to 1e-1.
beta_1 (float, optional): Momentum parameter. Defaults to 0.0. Note that typical AdaGrad does not use momentum, thus usually beta_1 is kept 0.0
tau (float, optional): Controls the degree of adaptability for the algorithm. Defaults to 1e-3. Smaller tau means higher degree of adaptability of server-side learning rate.

FedYogi

Federated learning strategy using Yogi on server-side https://arxiv.org/abs/2003.00295v5

Parameters

initial_parameters (Weights, optional): Initial global model parameters.
eta (float, optional): Server-side learning rate. Defaults to 1e-1.
beta_1 (float, optional): Momentum parameter. Defaults to 0.9.
beta_2 (float, optional): Second moment parameter. Defaults to 0.99.
tau (float, optional): Controls the degree of adaptability for the algorithm. Defaults to 1e-3.

Evaluation metrics

Metrics are calculated for each round of training.

When the session is complete, you can see a set of metrics for all rounds of training, as well as metrics for the final model.

Retrieve Metrics for a Session

Use the SessionMetrics class of the API to store and retrieve metrics for a session. You can retrieve the model performance metrics as a dictionary (Dict), or plot them. See the API Class Reference for details.

Typical usage example:

client = connect("token") 

already_trained_session_id = "<sessionID>"

session = client.fl_session(already_trained_session_id)

# retrieve the metrics for the session as a dictionary
metrics = session.metrics.as_dict()

Authenticate to and connect to the integrate.ai client.
Provide the session ID that you want to retrieve the metrics for as the already_trained_session_id`.
Call the SessionMetrics class.

Available Metrics

The Federated Loss value for the latest round of model training is reported as the global_model_federated_loss(float) attribute for an instance of SessionMetrics.

This is a model level metric reported for each round of training. It is a weighted average loss across different clients, weighted by the number of examples/samples from each silo.

See the metrics by machine learning task in the following table:

Classification and Logistic	Regression and Normal, Tweedie (power = 0)	Poisson, Gamma, Tweedie (power > 0), Inverse Gaussian
Loss (cross-entropy)	Loss (mean squared score)	Loss (unit deviance)
ROC_AUC	R2 score	R2 score
Accuracy

PRIVACY & SECURITY

Differential privacy

What is differential privacy?

Differential privacy is a technique for providing a provable measure of how “private” a data set can be. This is achieved by adding a certain amount of noise when responding to a query on the data. A balance needs to be struck between adding too much noise (making the computation less useful), and too little (reducing the privacy of the underlying data).

The technique introduces the concept of a privacy-loss parameter (typically represented by ε (epsilon)), which can be thought of as the amount of noise to add for each invocation of some computation on data. A related concept is the privacy budget, which can be chosen by the data curator.

This privacy budget represents the measurable amount of information exposed by the computation before the level of privacy is deemed unacceptable.

The benefit of this approach is that by providing a quantifiable measure, there can be a guarantee about how “private” the release of information is. However, in practice, relating the actual privacy to the computation in question can be difficult: i.e. how private is private enough? What will ε need to be? These are open questions in practice for practitioners when applying DP to an application.

How is Differential Privacy used in integrate.ai?

Users can add Differential Privacy to any model built in integrate.ai. DP parameters can be specified during session creation, within the model configuration.

When is the right time to use Differential Privacy?

Overall, differential privacy can be best applied when there is a clear ability to select the correct ε for the computation, and/or it is acceptable to specify a large enough privacy loss budget to satisfy the computation needs.

PRL Privacy for VFL

Private record linkage is a secure method for determining the overlap between datasets

In vertical federated learning (VFL), the datasets shared by any two parties must have some overlap and alignment in order to be used for machine learning tasks. There are typically two main problems:

The identifiers between the two datasets are not fully overlapped.
The rows of the filtered, overlapped records for the datasets are not in the same order.

To resolve these differences while maintaining privacy, integrate.ai applies private record linkage (PRL), which consists of two steps: determining the overlap (or intersection) and aligning the rows.

Private Set Intersection

First, Private Set Intersection (PSI) is used to determine the overlap without storing the raw data centrally, or exposing it to any party. PSI is a privacy-preserving technique that is considered a Secure multiparty computation (SMPC) technology. This type of technology uses cryptographic techniques to enable multiple parties to perform operations on disparate datasets without revealing the underlying data to any party.

Additionally, integrate.ai does not store the raw data on a server. Instead, the parties submit the paths to their datasets. integrate.ai uses the ID columns of the datasets to produce a hash locally that is sent to the server for comparison. The secret for this hash is shared only through Diffie–Hellman key exchange between the clients - the server has no knowledge of it.

Private Record Alignment

Once the ID columns are compared, the server knows which dataset indices are common between the two sets and can align the rows. This step is the private record alignment portion of PRL. It enables machine learning tasks to be performed on the datasets.

For more information about running PRL sessions, see PRL Sessions.

Release Notes

08 February 2025: Release 9.17.0

This release introduces the following new features and improvements:

Allow customer to use an existing Azure virtual network
Allow customer to specify an arbitrary storage path for Azure task runners
Allow customer to use an Azure task runner with a different Azure subscription, in EU region
Allow customer to use a custom Azure provisioner role instead of the default Contributor role
Enable Private link functionality for Azure
Allow customer to configure an ACR for use with their Azure task runner
PCA scalability improvements (using dask backend)
Allow customer to add prior sample weights for GLM and FFNet
Added a test script for AWS task runners that can be run from the command line. Details are available here.
Updated the metrics plot in the SDK to adaptively adjust the number of rows based on the number of metrics to show
Enabled the display of parquet metadata in client log output in the UI for debugging
Added UI visualization to show the underlying task infrastructure information, which was previously only available through the AWS Console

Updating from earlier versions

Required: Update your SDK to the latest version.

Resolved Issues

Fixed an issue where the Azure client failed with no logs in UI
Passive non-intersection are now only considered when hide_intersection=True
Fixed the dataset plot of EDA in the SDK
Updated the description of attachment functionality
Fixed rate limiting for flask limitation
Fixed an issue where the server exited with code 0 while the log showed failures

18 December 2024: Release 9.16.0

This release introduces the following new features and improvements:

UI design updates for improved usability.
Added the ability to see all task runner deployment logs in the UI.
Added the ability to generate single client 2D EDA histograms (both columns in same dataset) as well as cross-client 2D EDA histograms.
Verified support for Azure private links.
Added OIDC-based SSO IDP integration. Contact your integrate.ai Customer Support Engineer for more information.

Updating from earlier versions

Required: Update your SDK to the latest version.

Known Issues

Sessions run with a VFL data config that contains predictors set to [] for the active client may fail if one or more columns are of incompatible data type (for example: id column which is of string data type). The predictor supports only float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
VFL-GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions on AWS and Azure.
Test data is optional for PRL, but this needs to be consistent across all the clients. That is, if you do not specify the test data for one client, then do not specify it for any clients. Likewise, if you provide the test data for one client, you must provide it for each client joining the session.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
- The bucket must also be located in the same AWS region as the task runner.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: "An error occurred (ThrottlingException) when calling..." in the logs.

Resolved Issues

On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.
Corrected several error messages for improved usability.

21 Oct 2024: Release 9.15.0

This release introduces the following new features and improvements:

Added support for AWS UK regions.
Added the ability to configure an AWS task runner to use a specific ECR. Contact Customer Success for details.
Added the ability to run Azure tasks in an Azure Private Network. See the documentation for details.
Added the ability for user to specify their own provisioner and runtime principal for Azure task runners.
Improved the user experience with creating an Azure task runner by minimizing the task runner permission requirements.
Added the option to install a task runner on-premises in a VM or dedicated server. See the documentation for On-prem task runner (Server) for details.
Improved log readability for ease of use.
Improved the graph legibility in the UI and corrected some metric displays.
The task type is now indicated in the task selector drop-down on the session details page.
Removed the requirement for user-specified timeouts. Timeout values are now managed by the system.
Updated the workspace homepage.
Updated the UI for the Register task runner workflow.

Migration from earlier versions

Required: Update the policies for the iai-taskrunner-provisioner and iai-taskrunner-runtime. Follow the AWS configuration for task runners documentation.
Required: You must specify the provisioner and runtime principals when creating an Azure task runner. See the documentation for details.
Recommended: Upgrading your SDK is recommended but not mandatory with this release.

Known Issues

Sessions run with a VFL data config that contains predictors set to [] for the active client may fail if one or more columns are of incompatible data type (for example: id column which is of string data type). The predictor supports only float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
VFL-GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions on AWS and Azure.
Test data is optional for PRL, but this needs to be consistent across all the clients. That is, if you do not specify the test data for one client, then do not specify it for any clients. Likewise, if you provide the test data for one client, you must provide it for each client joining the session.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
- The bucket must also be located in the same AWS region as the task runner.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: "An error occurred (ThrottlingException) when calling..." in the logs.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.

Resolved Issues

Fixed the issue causing session timestamps in the UI to not adjust correctly for user timezones.
Fixed an issue that caused 1D EDA histograms to not display correctly.
Fixed an issue that allowed a user to set invalid parameters for vCPU and memory when creating an Azure task runner.
Fixed an issue that caused client tasks to be marked as succeeded when the task had an error.
Fixed an issue with the session metrics showing in the wrong format in the session details.
Corrected the permission descriptions for user roles. Note that only Administrators can see these descriptions.
Fixed an issue that caused task runner creation to fail when the AWS region us-west-1 was selected.
Added support for handling data with duplicated IDs in PRL sessions. See the documentation for details.

10 Sept 2024: Release 9.14.0

This release introduces the following new features and improvements:

Dataset connections are now synchronized. Any change you make to the dataset or the connected task runner, is automatically reflected in the partner’s workspace.
HFL Scalability improvements.
2D EDA histograms are now supported at scale on Azure.
The Tweedie Distribution is now supported for both HFL and VFL GLM sessions.
The SDK is now available as a wheel (.whl) package.
Server logs are now available in the workspace UI for all sessions.
Added support for West and North Europe regions for Azure task runners.
The start time for sessions is now displayed on the Session Details page in the UI.
The name of the task runner used for a task is now displayed with the detailed session log information.
By default, all sessions except VFL/HFL GBM now use dask as the backend instead of pandas to improve scalability.

Migration from earlier versions

Update your SDK
Update the custom trust policy for the iai-taskrunner-provisioner. Follow the AWS configuration for task runners documentation.

Known Issues

If an existing VPC is required, additional information must be provided to integrate.ai for use with task runners. Contact Customer Success for more information.
Sessions run with a VFL data config that contains predictors set to [] for the active client may fail if one or more columns are of incompatible data type (for example: id column which is of string data type). The predictor supports only float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
VFL-GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions on AWS and Azure.
For PRL sessions, exact matching is enabled by default. Therefore, if there are duplicate IDs in the dataset, the session will fail with an error message. Workaround: if you require fuzzy matching, set the match_threshold=0.99 when creating the session. This forces the use of fuzzy matching and gives effectively the same result as the exact matching. Note that exact matching currently requires approximately 4x less time to complete.
Test data is optional for PRL, but this needs to be consistent across all the clients. That is, if you do not specify the test data for one client, then do not specify it for any clients. Likewise, if you provide the test data for one client, you must provide it for each client joining the session.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
- The bucket must also be located in the same AWS region as the task runner.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: An error occurred (ThrottlingException) when calling... in the logs.
Session timestamps in the UI do not adjust correctly for user timezones. The timestamp in the database in UTC is correct.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.

Resolved Issues

Fixed the issue that was causing VFL with empty predictors (None) to fail
Fixed an issue where charts for VFL-GLM sessions running regression models showed num_examples as default instead of loss.
Fixed an issue with 2D histograms for EDA Intersect.
Fixed an issue that was preventing the use of client names with special characters in PRL sessions.
Fixed an issue with metrics plots for VFL training sessions.
Fixed an issue that was causing Azure logs to be truncated in the UI.
Activation emails have been updated to a new standardized layout.

14 August 2024: Release 9.13.0

This release introduces the following new features:

User is now able to invite a partner to evaluate their dataset using the Dataset Connection feature.
- Datasets stored in either AWS or Azure are supported
- Any attachments on the dataset are shared
Connections are not currently synchronized. If a change is made by the owner, it is not propagated to the partner.
The Feature Importance calculation is now enabled by default.
Scalability improvements
- PRL execution time reduced
- 1D EDA Intersect memory usage and execution time reduced
- VFL-GLM memory consumption more efficient by a factor of 4x and execution time reduced by a factor of 30x
- 2D EDA Intersect now supports large datasets

Migration from earlier versions

Required changes:

Update your SDK. This is a required update for this release due to breaking changes in the SDK.
Workspaces must be updated to support connected datasets.
If you did not update your AWS configuration before or after the 9.12.0 release, follow the AWS configuration for task runners documentation to re-create the `iai-taskrunner-provisioner` role and `iai-taskrunner-runtime` role.

Update the custom trust policy for the iai-taskrunner-role.
Attach any required permission boundaries to the roles.
The documentation describes how to create the policies, custom trust policies, and roles required for integrate.ai task runners to run in your AWS environment.

If you are required to use an existing VPC are AMI, additional information must be provided to integrate.ai for use with task runners. Contact Customer Support for details.

Known Issues

Sessions run with a VFL data config that contains predictors set to [] for the active client may fail if one or more columns are of incompatible data type (for example: id column which is of string data type). The predictor supports only float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
VFL-GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions on AWS and Azure.
For PRL sessions, exact matching is enabled by default. Therefore, if there are duplicate IDs in the dataset, the session will fail with an error message. Workaround: if you require fuzzy matching, set the match_threshold=0.99 when creating the session. This forces the use of fuzzy matching and gives effectively the same result as the exact matching. Note that exact matching currently requires approximately 4x less time to complete.
For PRL sessions, row_number is a reserved key in the dask backend. Therefore if "id_columns": ["row_number"] is used in a PRL session, the session fails.
Test data is optional for PRL, but this needs to be consistent across all the clients. That is, if you do not specify the test data for one client, then do not specify it for any clients. Likewise, if you provide the test data for one client, you must provide it for each client joining the session.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
- The bucket must also be located in the same AWS region as the task runner.
Proxy task runner mode has been disabled in this release. This means that task runners are not proxied through a load balancer and will therefore not have a static IP address.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: An error occurred (ThrottlingException) when calling... in the logs.
Session timestamps in the UI do not adjust correctly for user timezones. The timestamp in the database in UTC is correct.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.

11 July 2024: Release 9.12.0

This release introduces the following new features:

User is now able to invite a partner to evaluate their dataset using the Dataset Connection feature.
- Currently only datasets stored in AWS is available for Dataset Connection
- Attachments are currently not shared
Overall improvement on session and model training to handle larger datasets
- Can support PRL sessions with 20M rows and 35% overlap with 40 features
- PRL and VFL in Azure is only supported for small datasets (only with pandas)

Migration from earlier versions

Required changes:

Update your SDK. This is a required update for this release due to breaking changes in the SDK.
Update the custom trust policy for the iai-taskrunner-role.
If you did not update your AWS configuration before or after the 9.11.0 release, follow the AWS configuration for task runners documentation to re-create the iai-taskrunner-provisioner role and iai-taskrunner-runtime role.

Attach any required permission boundaries to the roles
The documentation describes how to create the policies, custom trust policies, and roles required for integrate.ai task runners to run in your AWS environment.

As in the previous release, if an existing VPC is required, additional information must be provided to integrate.ai for use with task runners.

Notice of Upcoming Changes:

The ability to specify VPC information through the UI will be added in a future release.

Known issues

For PRL sessions, exact matching is enabled by default. Therefore, if there are duplicate IDs in the dataset, the session will fail with the error message: CLKs must be unique for exact matching. Workaround: if you require fuzzy matching, set the match_threshold=0.99 when creating the session. This forces the use of fuzzy matching and gives effectively the same result as the exact matching. Note that exact matching currently requires approximately 4x less time to complete.
For PRL sessions, row_number is a reserved key in the dask backend. Therefore if "id_columns": ["row_number"] is used in a PRL session, the session fails.
EDA in Intersect mode: The hide_intersection functionality has been disabled in this release. The hide_intersection option has been set to False by default.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
VFL GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.
Proxy task runner mode has been disabled in this release. This means that task runners are not proxied through a load balancer and will therefore not have a static IP address.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: An error occurred (ThrottlingException) when calling... in the logs.
Custom model package names may not contain uppercase letters or special characters other than underscore.
Test data is optional for PRL, but this needs to be consistent across all the clients. That is, if you don’t specify the test data for one client, then it shouldn’t specified for any clients. Likewise, if you provide the test data for one client, you need to provide it for each client joining the session.
Session timestamps in the UI do not adjust correctly for user timezones. The timestamp in the database in UTC is correct.

14 June 2024: Release 9.11.0

This release introduces the following new features:

Revised task runner provisioner and runtime roles/policies for use in enterprise environments.
Ability to use an existing customer VPC when necessary.
Ability to attach and download python notebooks (*.ipynb) to datasets.

Migration from earlier versions

Required changes:

Follow the AWS configuration for task runners documentation to re-create the iai-taskrunner-provisioner role and iai-taskrunner-runtime role.

Attach any required permission boundaries to the roles
The documentation describes how to create the policies, custom trust policies, and roles required for integrate.ai task runners to run in your AWS environment.

With this release, if an existing VPC is required, additional information must be provided to integrate.ai for use with task runners.

VPC ID
Public subnet IDs
Private subnet IDs
Name of the taskrunner to create
AWS Region for the task runner
Required memory and vCPU values

Notice of Upcoming Changes:

Release 9.12.0 will also include an update to the custom trust policy for the iai-taskrunner-role. Customers will be required to update the role in their AWS account.
The ability to specify VPC information through the UI will be added in a future release.

Known issues:

EDA in Intersect mode: The hide_intersection functionality has been disabled in this release. The hide_intersection option has been set to False by default. Contact your integrate.ai CSE for more information.
For PRL sessions, row_number is a reserved key in the dask backend. Therefore if "id_columns": ["row_number"] is used in a PRL session, the session fails.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, then ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
In some instances, task logs may not appear in the UI. The logs do exist, the issue is that they are not displayed.
VFL GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.
Proxy task runner mode has been disabled in this release. This means that task runners are not proxied through a load balancer and will therefore not have a static IP address. The documentation for how to whitelist a task runner has been removed.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: An error occurred (ThrottlingException) when calling... in the logs.
Custom model package names may not contain uppercase letters or special characters other than underscore.
Test data is optional for PRL, but this needs to be consistent across all the clients. That is, if you don’t specify the test data for one client, then it shouldn’t specified for any clients. Likewise, if you provide the test data for one client, you need to provide it for each client joining the session.
Session timestamps in the UI do not adjust correctly for user timezones. The timestamp in the database in UTC is correct.

Fixes:

With certain datasets, the feature importance evaluation failed due to a casting error.
In some instances, task logs did not appear in the UI. The logs do exist, the issue is that they are not displayed.
Dataset descriptions did not maintain new lines.
Error logs with request IDs are not showing up in Cloudwatch/elastic search

15 May 2024: Release 9.10.0

This release introduces the following new features:

Ability to switch to a dask backend for improved speed when processing large datasets in PRL sessions (tested to up to 13M rows). A description of how to enable this mode and what configuration settings are available is provided in the documentation here.
You can now use a registered dataset name in any of the following session types instead of specifying file name and path:
- EDA
- PRL
- VFL (SplitNN, GLM, GBM) train and predict
- HFL
The Library and dataset details pages UI have been updated. You can now see the results of the latest EDA sessions run with it's respective dataset in the UI.
You can now see the version of the product and the date it was released in the navigation bar on the Settings page in the workspace UI.
Documentation has been added for the Data Valuation features, Dataset Influence and Feature Importance.

Known issues:

EDA in Intersect mode: The hide_intersection functionality has been disabled in this release. The hide_intersection option has been set to False by default.
For PRL sessions, row_number is a reserved key in the dask backend. Therefore if "id_columns": ["row_number"] is used in a PRL session, the session fails.
Storage paths for task runners: only the default S3 bucket and other buckets ending in *integrate.ai are supported as storage paths for task runners. If you are not using the default bucket created by task runner when it was provisioned, then ensure that your data is hosted in an S3 bucket with a URL ending in *integrate.ai. Otherwise, the data will not be accessible to the task runner.
In some instances, task logs may not appear in the UI. The logs do exist, the issue is that they are not displayed.
VFL GBM prediction results are not stored correctly and therefore metrics cannot be retrieved for these sessions.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.
Proxy task runner mode has been disabled in this release. This means that task runners are not proxied through a load balancer and will therefore not have a static IP address. The documentation for how to whitelist a task runner has been removed.
Starting many sessions simultaneously may trigger a throttling exception from AWS APIs. The error appears as: An error occurred (ThrottlingException) when calling... in the logs.
Custom model package names may not contain uppercase letters or special characters other than underscore.

Fixes:

Dataset names can now include spaces and special characters.
The output of a VFL prediction session is now saved under the folder named by the predict session ID itself instead of under the training session ID.
Fixed an issue that was causing the Terms of Service to be blank/empty.
The Credit Usage system has been deprecated and removed from the product.
Adding attachments to a dataset is now fully supported.
When deleting a task runner, the task runner no longer appears and disappears from the list in the UI more than once before the deletion is complete.

21 April 2024: Release 9.9.0

This release introduces the following new features:

As a Data Custodian user who has registered datasets, you can now see sessions that have used your datasets.
Session and task runner detail pages have been expanded to full page width for better readability.
A record of deleted task runners is now logged in CloudWatch for auditing purposes.
Session and task logs are now accessible through the session details page in the UI.
Python 3.9 is now the minimum supported version for use with the SDK.
The Credit usage tracking system is being deprecated. Related UI components will be removed in the next release.

Known issues

Adding attachments to a dataset is not fully supported.
- When updating a dataset, existing attachments are not displayed on the Update page.
- Attempting to add a new attachment removes all existing attachments.
When deleting a task runner, the task runner may appear and disappear from the list in the UI more than once before the deletion is complete. A task runner is fully deleted after the destroying step has been completed and the entry is permanently removed from the list.
On the Members > Admin page, not all users listed are Administrators. The listed user roles are correct.

Fixes:

Logging has been improved across the platform.
A number of inaccuracies in the documentation have been corrected.
The allow/disallow Custom Models toggle now works correctly when set by a Data Custodians user.

19 Mar 2024: Enterprise-Networking Ready Task Runners

This release introduces the following new features:

Task runners are now proxied through a load balancer with a fixed IP address to enable customers to allow ingress/egress through firewalls
Added dataset registration for HFL datasets
Removed the requirement to agree to the Terms of Service in the UI before being able to use the product
Added an embedded SDK package for use in enterprise environments
Enabled debug log levels in the task logs for improved troubleshooting

Known issues:

You must update existing task runners after this release is deployed. Click the Edit button for a task runner and make any change, such as increasing the amount of memory, then click Save. This will force the task runner to update and pick up the latest changes.
Dataset registration for EDA and HFL sessions only
- VFL sessions will be added in a future release.
- This means that the VFL sessions do not support using only a registered dataset name. You must provide both a name and path for each dataset.
Renewal term for credits may appear to be in the past when in Discovery mode

Fixes:

Fixed the region limitation for task runners
Fixed the state issues with destroyed task runners so that they are properly removed from the system and the UI
Fixed an issue with SplitNN models that occurred when more than 2 clients joined the session with hide_intersection set to True.
Fixed the VFL prediction download paths
Standardized type faces and other UI improvements

12 Feb 2024: Role-based Access Control (RBAC)

This release introduces the following new features:

Role-based access control with three new built-in roles: * Administrator - responsible for all aspects of managing the workspace * Model builder - responsible for running sessions and analyzing results * Data custodian - responsible for registering and maintaining dataset listings.
Added task level logs to the session details in the UI for ease of troubleshooting
Added the version number of the release (and therefore the task runners/client/server) on the Session Details page
Added example URLs in the UI to improve usability

Known issues:

Task runners can only be created in the ca-central-1 region of AWS.
- Will be fixed in next release
Dataset registration for EDA sessions only
- HFL sessions will be added in next release
Destroyed task runners remain in the list in the UI
- Will be fixed in next release
Renewal term for credits may appear to be in the past when in Discovery mode

Fixes:

Fixed the GRPC timeout errors
Fixed VFL-GLM for linear regression
Corrected the runtime role policy for task runners
Numerous UI improvements and corrections, including modal behaviour

19 Dec 2023: PCA, VFL-GBM, and Dataset Registration

This release provides several new features:

Principal Component Analysis (PCA) - you can now run PCA sessions using integrate.ai. An overview is available here.
VFL-GBM - you can now run Gradient Boosted Modeling in VFL sessions.
Dataset registration for EDA - You can now register a dataset with a task runner for exploratory data analysis. This simplifies the use of datasets as you no longer need to specify the path to the file in the task grou configuration. Instead, you can provide only the registered dataset name, and the task runner will locate the dataset. An overview is available here for AWS and here for Azure.

Version: 9.6.6

28 Aug 2023: Session Credit Usage

This release provides users with the ability to see their credit usage in their workspace UI. Each training or analytic session uses a certain number of credits from the user's allotment. This usage can now be monitored through a graph, with daily details. Users can also request additional credit when needed.

Version: 9.6.2

14 Aug 2023: Azure Task Runners

This release expands the Control Plane system architecture to include Microsoft Azure Task Runners.

Task runners simplify the process of creating an environment to run federated learning tasks. They use the serverless capabilities of cloud environements, which greatly reduces the administration burden and ensures that resource cost is only incurred while task runners are in use.

For more information about task runners and control plane capabilities, see Using integrate.ai.

A tutorial for using Azure task runners is available here.

Version: 9.6.1

14 July 2023: AWS Task Runners

This release introduces the Control Plane system architecture and AWS Task Runners.

Task runners simplify the process of creating an environment to run federated learning tasks. They use the serverless capabilities of cloud environements (such as AWS Batch and Fargate), which greatly reduces the administration burden and ensures that resource cost is only incurred while task runners are in use.

For more information about task runners and control plane capabilities, see Using integrate.ai.

Version: 9.6.0

17 May 2023: 2D Histograms for EDA Intersect & Linear Inference

This release introduces two new features:

The ability to generate 2D histograms for EDA sessions in Intersect mode. This feature requires the addition of a paired_cols parameter. For more information, see the Intersect Mode tutorial.
A new model package for linear inference. This package is particularly useful for GWAS training. For more information, see Linear Inference Sessions.

New in this release is also the addition of a single release version number to describe the release package. This release is version: 9.5.0.

27 April 2023: PRL, VFL, and EDA Intersect

This release introduces the following new features:

The ability to perform private record linkage (PRL) on two datasets. A guide is available here.
The ability to perform exploratory data analysis (EDA) in Intersect mode, using a PRL session result. A guide is available here.
The ability to perform Vertical Federated Learning (VFL) in both training and prediction mode. A guide is available here.

Note: this release does not support Python 3.11 due to a known issue in that Python release.

Versions:

SDK: 0.9.2
Client: 2.4.2
Server: 2.6.0
CLI Tool: 0.0.46

30 Jan 2023: User Authentication

This release added two new features:

The ability to train Gradient Boosted HFL Models. A guide is available here.
The ability to generate scoped (non-admin) tokens for users. A guide is available here.

Bug fixes:

Clients may get disconnected from the server when training large models

Versions:

SDK: 0.5.36
Client: 2.0.18
Server: 2.2.19
CLI Tool: 0.0.38

08 Dec 2022: Integration with AWS Fargate

This release introduces the ability to run an IAI training server on AWS Fargate through the integrate.ai SDK. With an integrate.ai training server running on Fargate, your data in S3 buckets, and clients running on AWS Batch, you can use the SDK to manage and run fully remote training sessions. A guide is available here.

Versions:

SDK: 0.5.13
Client: 2.0.11
CLI Tool: 0.0.33

02 Nov 2022: Integration with AWS Batch

This release introduces the ability to run AWS Batch jobs through the integrate.ai SDK. Building on the previous release, which added support for data hosted in S3 buckets, you can now start and run a remote training session on remote data with jobs managed by AWS Batch. A guide is available here.

Features:

Added the ability to run the iai client through AWS Batch
Added the ability for the iai client to retrieve a token through the IAI_TOKEN environment variable
Added a version command for the iai client: iai client version

Note: Docker must be running for the version command to return a value.

Added support for a new session "pending" status

BREAKING CHANGE:

Session status mapping in the SDK has been updated as follows:

created -> Created
started -> Running
pending -> Pending
failed -> Failed
succeeded -> Completed
canceled -> Canceled

Bug fixes:

Fixed an issue with small batch sizes

Versions:

SDK: 0.3.31
Client: 1.0.15
CLI Tool: 0.0.31

06 Oct 2022: Exploratory Data Analysis & S3 Support

Features:

Exploratory Data Analysis (EDA) - integrate.ai now supports the ability to generate histograms for each feature of a dataset. Use the results of the EDA session to calculate summary statistics for both continuous and categorical variables. See more about the feature here.
This feature has Differential Privacy applied automatically to each histogram to add noise and reduce privacy leakage. The Differential Privacy settings are dynamic and applied to best suit e each dataset individually, to ensure privacy protection without excessive noise.
S3 data path support - load data from an s3 bucket for the iai client hfl and iai client eda commands. You can use S3 URLs as the data_path given that your AWS CLI environment is properly configured. Read more on how to configure this integration here.
Client logging via iai client log command - this new feature in the integrate-ai CLI package allows a user to access session logs from clients, to be used as a tool to help debug failed sessions. Access this using the iai client log command.

Versions:

SDK: 0.3.20
Client: 1.0.8
CLI Too: 0.0.21

14 Sept 2022: Infrastructure upgrades for session abstraction

SDK Version: 0.3.5
Client Version: 1.0.2
CLI Tool Version: 0.0.21

API Documentation

The latest integrate.ai API documentation is always available online:

API Documentation