Private Record Linkage (PRL) sessions¶
Private record linkage sessions create intersection and alignment among datasets to prepare them for vertical federated learning (VFL).
In a vertical federated learning process, two or more parties collaboratively train a model using datasets that share a set of overlapping features. These datasets generally each contain distinct data with some overlap. This overlap is used to define the intersection of the sets. Private record linkage (PRL) uses the intersection to create alignment between the sets so that a shared model can be trained.
Overlapping records are determined privately through a PRL session, which combines Private Set Intersection with Private Record Alignment.
For example, in data sharing between a hospital (party B, the Active party) and a medical imaging centre (party A, the Passive party), only a subset of the hospital patients will exist in the imaging centre’s data. The hospital can run a PRL session to determine the target subset for model training.
PRL Session Overview¶
In PRL, two parties submit paths to their datasets so that they can be aligned to perform a machine learning task.
ID columns (id_columns) are used to produce a hash that is sent to the server for comparison. The secret for this hash is shared between the clients and the server has no knowledge of it. This comparison is the Private Set Intersection (PSI) part of PRL.
Once compared, the server orchestrates the data alignment because it knows which indices of each dataset are in common. This is the Private Record Alignment (PRA) part of PRL.
For information about privacy when performing PRL, see PRL Privacy for VFL.
Configure PRL Session¶
This example uses an AWS task runner run the session. Ensure that you have installed the SDK locally, created a task runner and registered a dataset. Use the integrateai_taskrunner_AWS.ipynb
notebook to follow along and test the examples shown below by filling in your own variables.
To create the session, specify the data_config
that contains the client names and columns to use as identifiers to link the datasets. For example: prl_data_config
.
Important: The client names are referenced for the compute on the PRL session and for any sessions that use the PRL session downstream. For example, when running EDA in Intersect mode or for any VFL session, the client names must be the same as specified for the PRL session.
PRL data config
Specify a prl_data_config
that indicates the columns to use as identifiers when linking the datasets to each other. The number of items in the config specifies the number of expected clients. In this example, there are two items and therefore two clients submitting data. Their datasets are linked by the “id” column in any provided datasets.
Working with data that contains duplicate IDs
To enable support for data with duplicated IDs:
Specify
"is_id_unique"
for each dataset (client) in the data config, as shown in the example.By default,
is_id_unique
is set toTrue
for all datasets.
# Example data config
prl_data_config = {
"clients": {
"active_client": {
"id_columns": ["id"],
"is_id_unique": True, //by default, IDs are assumed to be unique
}
}
},
"passive_client": {
"id_columns": ["id"],
"is_id_unique": False, //set to False if there are duplicated IDs in the dataset
}
}
}
Create the PRL session¶
# Create session example
prl_session = client.create_prl_session(
name="Testing notebook - PRL",
description="I am testing PRL session creation with a task runner through a notebook",
data_config=prl_data_config
).start()
prl_session.id
Create a task builder and task group for PRL¶
Create a task group with one task for each of the clients joining the session. In this example, the datasets are registered and referenced by their names, instead of by specifying the name and path.
task_group = (SessionTaskGroup(prl_session)\
.add_task(iai_tb_aws.prl(train_dataset_name="active_train", test_dataset_name="active_test", client_name="active_client"))\
.add_task(iai_tb_aws.prl(train_dataset_name="passive_train", test_dataset_name="passive_test", client_name="passive_client"))
)
task_group_context = task_group.start()
Monitor submitted jobs¶
Next, you can check the status of the tasks.
for i in task_group_context.contexts.values():
print(json.dumps(i.status(), indent=4))
task_group_context.monitor_task_logs()
Submitted tasks are in the pending state until the clients join and the session is started. Once started, the status changes to running.
# Wait for the tasks to complete (success = True)
task_group_context.wait(60*5, 2)
View the overlap statistics¶
When the session is complete, you can view the overlap statistics for the datasets.
metrics = prl_session.metrics().as_dict()
metrics
Example result:
{'session_id': '07d0f8358d',
'federated_metrics': [],
'client_metrics': {'passive_client': {'train': {'n_records': 14400,
'n_overlapped_records': 12963,
'frac_overlapped': 0.9},
'test': {'n_records': 3600,
'n_overlapped_records': 3245,
'frac_overlapped': 0.9}},
'active_client': {'train': {'n_records': 14400,
'n_overlapped_records': 12963,
'frac_overlapped': 0.9},
'test': {'n_records': 3600,
'n_overlapped_records': 3245,
'frac_overlapped': 0.9}}}}
To run a VFL training session on the linked datasets, see VFL SplitNN Model Training.
To perform exploratory data analysis on the intersection, see EDA Intersect.
Back to VFL Model Training