EDA in Intersect Mode¶
The Exploratory Data Analysis (EDA) Intersect mode enables you to access summary statistics about the intersection of a group of datasets without needing access to the data itself. It is used primarily with VFL use cases to understand statistics about the overlapped records. For example, you may be interested in “what is the distribution of age among the intersection” (the intersection mode), which could be a very different answer from “what is the distribution of age among the whole population” (the individual mode).
With EDA Intersect, you can:
obtain descriptive statistics, as in EDA Individual mode, but on the intersection (i.e., overlapping indices)
use bivariate operations such as
groupby
orcorrelation
compute the correlation matrix
plot a 2D histogram based on the joint distribution of two features from different clients
EDA is an important pre-step for federated modelling and a simple form of federated analytics. This feature has a built-in mechanism to achieve differential privacy. Proper noise is dynamically added to each histogram that is generated for each feature in a participating dataset. This adds extra protection on the raw data by making the final results differentially private, but at the same time it means the final results will deviate slightly from the true values.
API Reference
The core API module that contains the EDA-specific functionality is integrate_ai_sdk.api.eda
. This module includes a core object called EdaResults
, which contains the results of an EDA session. If you are comfortable working with the integrate.ai SDK and API, see the API Documentation for details.
Configure an EDA Intersect Session¶
This example uses an AWS task runner to run the session using data in S3 buckets. Ensure that you have completed the AWS configuration for task runners and that a task runner exists. See Register an AWS task runner for details.
To begin exploratory data analysis, you must first create a session. To configure the session, specify the following:
EDA data configuration (
eda_data_config
)prl_session_id
for the PRL session you want to work with
The eda_data_config
specifies the names of the datasets used to generate the intersection in the PRL session in the format dataset_name : columns
. If columns
is empty ([]
), then EDA is performed on all columns.
Example data paths in S3
aws_taskrunner_profile = "<myworkspace>" # This is your workspace name
aws_taskrunner_name = "<mytaskrunner>" # Task runner name - must match what was supplied in UI to create task runner
base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'
active_train_path = f's3://{base_aws_bucket}/vfl/active_train.parquet'
active_test_path = f's3://{base_aws_bucket}/vfl/active_test.parquet'
passive_train_path = f's3://{base_aws_bucket}/vfl/passive_train.parquet'
passive_test_path = f's3://{base_aws_bucket}/vfl/passive_test.parquet'
#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'
eda_data_config =
{
"active": {"columns": [],},
"passive": { "columns": [],},
prl_session_id = "<PRL_SESSION_ID>"
You must also specify the session ID of a previous successful PRL session that used the same datasets listed in the eda_data_config
.
Paired Columns¶
To find the correlation (or any other binary operation) between two specific columns between two datasets, you must specify those columns as paired columns.
To set which pairs you are interested in, specify their names in a dictionary like eda_data_config
.
single_client_2d_pairs = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}
cross_client_2d_pairs = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}
For example: {"passive_client": ['x1', 'x5'], "active_client": ['x0', 'x2']}
will generate 2D histograms for these pairs of columns:
(x0, x1), (x0, x5), (x2, x1), (x2, x5), (x0, x2), (x1, x5)
Create an EDA intersect session¶
The code sample demonstrates creating and starting an EDA session to perform privacy-preserving data analysis on the intersection of two distinct datasets. It returns an EDA session ID that you can use to track and reference your session.
Create the EDA intersect session
eda_session = client.create_eda_session(
name="EDA Intersect Session",
description="Testing EDA Intersect mode through a notebook",
data_config=eda_data_config,
eda_mode="intersect", #Generates histograms on an overlap of two distinct datasets
prl_session_id=prl_session_id, #Required for intersect mode
single_client_2d_pairs = None, #Optional - only required to generate single node 2D histograms (both columns in same dataset)
cross_client_2d_pairs = None #Optional - only required to generate 2D histograms (between two datasets)
).start()
eda_session.id
Session parameters:
eda_mode
= One of {‘individual’,’intersect’}. Defaults to ‘individual’.prl_session_id
(str, optional): Session ID of associated PRL session. Required for eda_mode = ‘intersect’.single_client_2d_pairs
(Dict, optional): a data_config like dict with the column names to use to generate 2d-histograms when both columns belong to same dataset. This option only considers the single node 2d-histograms and is valid for both ‘intersect’ and ‘individual’ mode. Defaults to None.cross_client_2d_pairs
(Dict, optional): a data_config like dict with the column names to use to generate cross-dataset 2d-histograms. This option only works for ‘intersect’ mode. Defaults to None.
Note that, unlike other sessions, you do not need to specify a model_config
file for EDA. This is because there are no editable parameters.
For more information, see the create_eda_session()
definition in the API documentation.
Create and run the task group¶
Create a task in the task group for each client. The number of tasks in the task group must match the number of clients specified in the data_config
used to create the session.
# Create a task group with one task for each of the clients joining the session. If you have registered your dataset(s), specify only the `dataset_name` (no `dataset_path`).
eda_task_group_context = (
SessionTaskGroup(eda_session) \
.add_task(iai_tb_aws.eda(dataset_name="active_train" dataset_path="active_train_path"))\
.add_task(iai_tb_aws.eda(dataset_name="passive_train", dataset_path=passive_train_path"))\
)
eda_task_group_context = task_group.start()
Monitor submitted EDA Intersect jobs¶
Submitted tasks are in the Pending
state until the clients join and the session is started. Once started, the status changes to Running
.
for i in eda_task_group_context.contexts.values():
print(json.dumps(i.status(), indent=4))
eda_task_group_context.monitor_task_logs()
# Wait for the tasks to complete (success = True)
eda_task_group_context.wait(60*5, 2)
When the session completes successfully, “True” is returned. Otherwise, an error message appears.
Analyze the results¶
To retrieve the results of the session:
results = eda_session.results()
Example output:
EDA Session collection of datasets: ['active_client', 'passive_client']
Describe
You can use the .describe()
function to review the results.
results.describe()
Example output:
Statistics
For categorical columns, you can use other statistics for further exploration. For example, unique_count
, mode
, and uniques
.
results["active_client"][["x10", "x11"]].uniques()
Example output:
Mean
You can call functions such as .mean()
, .median()
, and .std()
individually.
results["active_client"].mean()
Example output:
Histograms
You can create histogram plots using the .plot_hist()
function.
saved_dataset_one_hist_plots = results["active_client"].plot_hist()
Example output:
single_hist = results["active_client"]["x10"].plot_hist()
Example output:
2D Histograms
You can also plot 2D-histograms of specified paired columns.
fig = results.plot_hist(active_client['x0'], passive_client['x1'])
Example output:
Correlation
You can perform binary calculations on columns specified in paired_columns
, such as finding the correlation.
results.corr(active_client['x0'], passive_client['x1'])
Example output:
Addition, subtraction, division
Addition example. Change the operator to try subtraction, division, etc.
op_res = active_client['x0']+passive_client['x1']
fig = op_res.plot_hist()
Example output:
GroupBy
groupby_result = results.groupby(active_client['x0'])[passive_client['x5']].mean()
print (groupby_result)