EDA in Intersect Mode

The Exploratory Data Analysis (EDA) Intersect mode enables you to access summary statistics about the intersection of a group of datasets without needing access to the data itself. It is used primarily with VFL use cases to understand statistics about the overlapped records. For example, you may be interested in “what is the distribution of age among the intersection” (the intersection mode), which could be a very different answer from “what is the distribution of age among the whole population” (the individual mode).

With EDA Intersect, you can:

  • obtain descriptive statistics, as in EDA Individual mode, but on the intersection (i.e., overlapping indices)

  • use bivariate operations such as groupby or correlation

  • compute the correlation matrix

  • plot a 2D histogram based on the joint distribution of two features from different clients

EDA is an important pre-step for federated modelling and a simple form of federated analytics. This feature has a built-in mechanism to achieve differential privacy. Proper noise is dynamically added to each histogram that is generated for each feature in a participating dataset. This adds extra protection on the raw data by making the final results differentially private, but at the same time it means the final results will deviate slightly from the true values.

API Reference

The core API module that contains the EDA-specific functionality is integrate_ai_sdk.api.eda. This module includes a core object called EdaResults, which contains the results of an EDA session. If you are comfortable working with the integrate.ai SDK and API, see the API Documentation for details.

Configure an EDA Intersect Session

This example uses an AWS task runner to run the session using data in S3 buckets. Ensure that you have completed the AWS configuration for task runners and that a task runner exists. See Register an AWS task runner for details.

To begin exploratory data analysis, you must first create a session. To configure the session, specify the following:

  • EDA data configuration (eda_data_config)

  • prl_session_id for the PRL session you want to work with

The eda_data_config specifies the names of the datasets used to generate the intersection in the PRL session in the format dataset_name : columns. If columns is empty ([]), then EDA is performed on all columns.

Example data paths in S3

aws_taskrunner_profile = "<myworkspace>" # This is your workspace name
aws_taskrunner_name = "<mytaskrunner>" # Task runner name - must match what was supplied in UI to create task runner

base_aws_bucket = f'{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai'

active_train_path = f's3://{base_aws_bucket}/vfl/active_train.parquet'
active_test_path = f's3://{base_aws_bucket}/vfl/active_test.parquet'
passive_train_path = f's3://{base_aws_bucket}/vfl/passive_train.parquet'
passive_test_path = f's3://{base_aws_bucket}/vfl/passive_test.parquet'

#Where to store the trained model
aws_storage_path = f's3://{base_aws_bucket}/model'
eda_data_config = 
{
    "active": {"columns": [],},  
    "passive": { "columns": [],},

prl_session_id = "<PRL_SESSION_ID>"

You must also specify the session ID of a previous successful PRL session that used the same datasets listed in the eda_data_config.

Paired Columns

To find the correlation (or any other binary operation) between two specific columns between two datasets, you must specify those columns as paired columns.

To set which pairs you are interested in, specify their names in a dictionary like eda_data_config.

single_client_2d_pairs = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}

cross_client_2d_pairs = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}

For example: {"passive_client": ['x1', 'x5'], "active_client": ['x0', 'x2']}

will generate 2D histograms for these pairs of columns:

(x0, x1), (x0, x5), (x2, x1), (x2, x5), (x0, x2), (x1, x5)

Create an EDA intersect session

The code sample demonstrates creating and starting an EDA session to perform privacy-preserving data analysis on the intersection of two distinct datasets. It returns an EDA session ID that you can use to track and reference your session.

Create the EDA intersect session

eda_session = client.create_eda_session(
    name="EDA Intersect Session",
    description="Testing EDA Intersect mode through a notebook",
    data_config=eda_data_config,
    eda_mode="intersect",               #Generates histograms on an overlap of two distinct datasets
    prl_session_id=prl_session_id,      #Required for intersect mode
    single_client_2d_pairs = None,      #Optional - only required to generate single node 2D histograms (both columns in same dataset)
    cross_client_2d_pairs = None       #Optional - only required to generate 2D histograms (between two datasets)
).start()

eda_session.id

Session parameters:

  • eda_mode = One of {‘individual’,’intersect’}. Defaults to ‘individual’.

  • prl_session_id (str, optional): Session ID of associated PRL session. Required for eda_mode = ‘intersect’.

  • single_client_2d_pairs (Dict, optional): a data_config like dict with the column names to use to generate 2d-histograms when both columns belong to same dataset. This option only considers the single node 2d-histograms and is valid for both ‘intersect’ and ‘individual’ mode. Defaults to None.

  • cross_client_2d_pairs (Dict, optional): a data_config like dict with the column names to use to generate cross-dataset 2d-histograms. This option only works for ‘intersect’ mode. Defaults to None.

Note that, unlike other sessions, you do not need to specify a model_config file for EDA. This is because there are no editable parameters.

For more information, see the create_eda_session() definition in the API documentation.

Create and run the task group

Create a task in the task group for each client. The number of tasks in the task group must match the number of clients specified in the data_config used to create the session.

# Create a task group with one task for each of the clients joining the session. If you have registered your dataset(s), specify only the `dataset_name` (no `dataset_path`). 

eda_task_group_context = (
        SessionTaskGroup(eda_session) \
        .add_task(iai_tb_aws.eda(dataset_name="active_train" dataset_path="active_train_path"))\
        .add_task(iai_tb_aws.eda(dataset_name="passive_train", dataset_path=passive_train_path"))\
    )

eda_task_group_context = task_group.start()

Monitor submitted EDA Intersect jobs

Submitted tasks are in the Pending state until the clients join and the session is started. Once started, the status changes to Running.

for i in eda_task_group_context.contexts.values():
    print(json.dumps(i.status(), indent=4))

eda_task_group_context.monitor_task_logs()

# Wait for the tasks to complete (success = True)

eda_task_group_context.wait(60*5, 2)

When the session completes successfully, “True” is returned. Otherwise, an error message appears.

Analyze the results

To retrieve the results of the session:

results = eda_session.results()

Example output:

EDA Session collection of datasets: ['active_client', 'passive_client']

Describe

You can use the .describe() function to review the results.

results.describe()

Example output:

EDA intersect describe output

Statistics

For categorical columns, you can use other statistics for further exploration. For example, unique_count, mode, and uniques.

results["active_client"][["x10", "x11"]].uniques()

Example output:

EDA intersect describe output

Mean

You can call functions such as .mean(), .median(), and .std() individually.

results["active_client"].mean()

Example output:

EDA intersect describe output

Histograms

You can create histogram plots using the .plot_hist() function.

saved_dataset_one_hist_plots = results["active_client"].plot_hist()

Example output:

EDA intersect describe output

EDA intersect describe output

single_hist = results["active_client"]["x10"].plot_hist()

Example output:

EDA intersect describe output

2D Histograms

You can also plot 2D-histograms of specified paired columns.

fig = results.plot_hist(active_client['x0'], passive_client['x1'])

Example output:

EDA intersect describe output

Correlation

You can perform binary calculations on columns specified in paired_columns, such as finding the correlation.

results.corr(active_client['x0'], passive_client['x1'])

Example output:

EDA intersect describe output

Addition, subtraction, division

Addition example. Change the operator to try subtraction, division, etc.

op_res = active_client['x0']+passive_client['x1'] fig = op_res.plot_hist()

Example output:

EDA intersect describe output

GroupBy

groupby_result = results.groupby(active_client['x0'])[passive_client['x5']].mean() print (groupby_result)