Relevance Analysis

Relevance Analysis¶

Triage and understand the new data relative to your existing book of business. Perform a statistical examination to assess the value of the new data by analyzing how it aligns with and enriches your existing consumer data. The outcome is to identify candidate features that show promise for enhancing model performance.

This evaluation guide implements the main workflows required to evaluate a data product:

Match Rate: How much data from a Data Provider’s data product is usable in reference to my internal data?
Fill rate: How much data is missing in the data product?
Univariate analysis: What are the distributions of the different variables in the data product before/after matching with my internal data?
Bivariate analysis: What are the relationships between different variables in the data product and my own features?

Table of Contents

Setup and Configuration
Relevance Analysis Example
1. Match Rate
2. Fill rate
3. EDA With Feature Preprocessing
  1. Feature Preprocessing
  2. EDA Intersect
  3. Univariate Analysis
  4. Bivariate Analysis

Setup and Configuration¶

Set environment variables with your IAI credentials¶

Generate and manage your IAI_TOKEN in your workspace.

[ ]:

from integrate_ai_sdk.api import connect
from integrate_ai_sdk.taskgroup.taskbuilder.integrate_ai import IntegrateAiTaskBuilder
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup
import pandas as pd
import numpy as np
import os
import json
from matplotlib import pyplot as plt
pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000

IAI_TOKEN = ""
client = connect(token=IAI_TOKEN)

Read vendor configuration¶

Specify the vendor configuration file name and other parameters.

In this example, a separate configuration file is used to contain a number of parameters that do not need to be adjusted. This file is loaded with the command below.

[2]:

with open('realistic_vendor_v1_config.json', 'r') as file:
    config = json.load(file)

provider_client = config['client_name']
provider_name = config['dataset_name']
vendor_prl_config = config['prl_config']
vendor_raw_features = config['raw_features']
vendor_bivariate_features = config['bivariate_features']
vendor_eda_preprocessing_config = config['eda_preprocessing']
vendor_vfl_preprocessing_config = config['model_preprocessing']

Specify datasets and features¶

Specify the datasets to be used in the evaluation

Important: The task runner expects your data to be in the bucket that was created when the task runner was provisioned.

[3]:

consumer_client = 'syn_carrier'
consumer_train_name = 'syn_data_realistic_v1_carrier_train'
consumer_test_name = 'syn_data_realistic_v1_carrier_test'

Create a task builder for each client (party) in the evaluation. For example, carrier and provider.

Note: The task_runner_id can be left empty. The system determines the correct task runner to use.

[4]:

iai_tb_aws_consumer = IntegrateAiTaskBuilder(client=client, task_runner_id="")
iai_tb_aws_provider = IntegrateAiTaskBuilder(client=client, task_runner_id="")

Specify the target variable and the consumer features for the EDA session. Specify the join key for the PRL session.

[5]:

target = 'target'

provider_univariate_features = vendor_raw_features
provider_bivariate_features = vendor_bivariate_features

consumer_bivariate_features = [target]

join_key_consumer = ['id']

Relevance Analysis Example¶

Objective: Test the relevance of a data product via match rate, fill rate, and statistical analyses on both raw and matched records between vendor and carrier datasets.

Match Rate¶

Step 1: Specify the PRL data configuration. The client names are the variable names for the datasets from the “Specify datasets and features” section above.

Make sure you specify the actual column name that represents the join key as the id_column.

[6]:

prl_data_config = {
    "clients": {
        consumer_client: {
            "id_columns": join_key_consumer
        },
        provider_client: vendor_prl_config
    }
}

If you have previously run a PRL session and want to use that session instead of a new one, you can load the session by specifying its ID.

[7]:

# Optional
existing_prl_session_id = None

Step 2: Create and start PRL session. You can modify the name and description to help you differentiate it from other sessions.

This example demonstrates an if/else option that allows you to use either a previous PRL session OR to start a new one.

[8]:

if existing_prl_session_id:
    prl_session = client.session(existing_prl_session_id)
    print(existing_prl_session_id) # Prints the session ID for reference
else:
    prl_session = client.create_prl_session(
        name="Phase 1 eval - PRL",
        description="Phase 1 eval - PRL",
        data_config=prl_data_config
    ).start()

    task_group = (
        SessionTaskGroup(prl_session)
        .add_task(iai_tb_aws_consumer.prl(train_dataset_name=consumer_train_name, test_dataset_name=consumer_test_name, client_name=consumer_client))\
        .add_task(iai_tb_aws_provider.prl(train_dataset_name=provider_name, test_dataset_name=provider_name, client_name=provider_client))\
    )

    task_group_context = task_group.start()

    print(prl_session.id)      # Prints the session ID for reference

11/18/2025 13:39:12:INFO:Setting SHA256 for PRL

d562a556ae

Wait for the session to complete. Session details are displayed on the Sessions page in your workspace UI. Use the session ID to identify your session.

View the match rate¶

[10]:

metrics = prl_session.metrics().as_dict()
n_train = metrics['client_metrics'][consumer_client]['train']['n_overlapped_records']
summary_table = pd.DataFrame(metrics['client_metrics'][consumer_client]).T
summary_table

[10]:

	n_records	n_overlapped_records	frac_overlapped
train	400000.0	360868.0	0.9
test	100000.0	90271.0	0.9

Fill Rate¶

Step 1: Specify the data config dictionary

[11]:

eda_raw_data_config = {provider_client: provider_univariate_features}

existing_raw_eda_individual_session_id = None
existing_raw_eda_intersect_session_id = None

Step 2: Create and start the EDA session. You can edit the name or description of the session.

As in the PRL session example above, you can choose to use an existing session ID OR run a new session.

First we will start an EDA session in individual mode. This mode does not require a PRL session as input.

[12]:

if existing_raw_eda_individual_session_id:
    raw_eda_individual_session = client.session(existing_raw_eda_individual_session_id)
    print("eda individual raw: ", raw_eda_individual_session.id)    #Prints the session ID for reference
else:
    raw_eda_individual_session = client.create_eda_session(
        name="Phase 1 eval - EDA individual raw",
        description="",
        data_config=eda_raw_data_config,
        eda_mode="individual"
    ).start()

    task_group = (
        SessionTaskGroup(raw_eda_individual_session)
        .add_task(iai_tb_aws_provider.eda(dataset_name=provider_name, client_name=provider_client))
    )

    task_group_context = task_group.start()

    print("eda individual raw: ", raw_eda_individual_session.id)     #Prints the session ID for reference

eda individual raw:  da442bc597

Next, start an EDA session in intersect mode, using the overlap from the PRL session run earlier.

[13]:

if existing_raw_eda_intersect_session_id:
    raw_eda_intersect_session = client.session(existing_raw_eda_intersect_session_id)
    print("eda intersect raw: ", raw_eda_intersect_session.id)    #Prints the session ID for reference
else:
    raw_eda_intersect_session = client.create_eda_session(
        name="Phase 1 eval - EDA intersect raw",
        description="",
        data_config=eda_raw_data_config,
        eda_mode="intersect",
        prl_session_id=prl_session.id
    ).start()

    task_group = (
        SessionTaskGroup(raw_eda_intersect_session)
        .add_task(iai_tb_aws_provider.eda(dataset_name=provider_name, client_name=provider_client))
    )

    task_group_context = task_group.start()

    print("eda intersect raw: ", raw_eda_intersect_session.id)     #Prints the session ID for reference

eda intersect raw:  0114bfc809

Wait for the session to complete. Session details are displayed on the Sessions page in your workspace UI. Use the session ID to identify your session.

View the vendor feature fill rate.¶

fill_rate_marginal is the fill rate (of each feature) for the entire vendor dataset.

fill_rate_matched is the fill rate (of each feature) for the overlapped records beteween the carrier data and vendor data.

[15]:

marginal_fill_rate_df = raw_eda_individual_session.results().info()[['Column', 'Non-Null Count', 'Total Count']]
marginal_fill_rate_df['fill_rate'] = marginal_fill_rate_df['Non-Null Count'] / marginal_fill_rate_df['Total Count']

matched_fill_rate_df = raw_eda_intersect_session.results().info()[['Column', 'Non-Null Count', 'Total Count']]
matched_fill_rate_df['fill_rate'] = matched_fill_rate_df['Non-Null Count'] / matched_fill_rate_df['Total Count']

fill_rate_table = pd.merge(marginal_fill_rate_df, matched_fill_rate_df, on='Column', suffixes=('_marginal', '_matched'))
fill_rate_table

[15]:

	Column	Non-Null Count_marginal	Total Count_marginal	fill_rate_marginal	Non-Null Count_matched	Total Count_matched	fill_rate_matched
0	constructions_imputed	4999992	4999992	1.0	360868	360868	1.0
1	storeys_imputed	4999992	4999992	1.0	360868	360868	1.0
2	has_basement_imputed	4999992	4999992	1.0	360868	360868	1.0
3	use_imputed	4999992	4999992	1.0	360868	360868	1.0
4	floor_level_imputed	4999992	4999992	1.0	360868	360868	1.0
5	height_imputed	4999995	4999995	1.0	360866	360866	1.0

(Optional) Check the raw vendor feature distribution and see if you want to preprocess them in a specific way.

The default vendor feature preprocessing is provided in the vendor config. This step is optional and only needed if you want to check the raw vendor feature distribution.

[16]:

hist = raw_eda_intersect_session.results()[provider_client].plot_hist()

../_images/notebooks_Phase1_eval_guide_template_31_0.png

[17]:

raw_eda_intersect_session.results()[provider_client].describe(categorical=True).T

[17]:

	count	unique	top	freq
storeys_imputed	360868	[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...	2.0	124049
constructions_imputed	360868	[CF, CW, PF, SF, SM, TF, Unknown]	CW	162492
has_basement_imputed	360868	[Unknown, no, yes]	yes	194750
use_imputed	360868	[N, P, R, Unknown]	R	259986
floor_level_imputed	360868	[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]	1.0	155417

EDA with Feature Preprocessing¶

Feature Preprocessing¶

Specify how you want to preprocess your raw features in the config below.
The default vendor feature preprocessing is provided in the vendor config file. They dont have to be modified unless you want to try different preprocessing

First, define how the raw features should be preprocessed for the EDA. If you have already created and run a transform session, you can specify the session ID instead of running a new one.

[18]:

fp_data_config = {
    provider_client: vendor_eda_preprocessing_config
}

existing_transform_intersect_session_id = None

Apply feature preprocessing on the overlap between vendor data and carrier data.

[25]:

if existing_transform_intersect_session_id:
    transform_intersect_session = client.session(existing_transform_intersect_session_id)
    print(transform_intersect_session.id) # Prints the session ID for reference
else:
    transform_intersect_session = client.create_transform_session(
        name="Phase 1 eval - transform intersect",
        description="",
        data_config=fp_data_config,
        transform_mode="intersect",
        prl_session_id=prl_session.id
    ).start()

    task_group = (
        SessionTaskGroup(transform_intersect_session)
        #.add_task(iai_tb_aws_consumer.transform(dataset_name=consumer_train_name, client_name=consumer_client))\
        .add_task(iai_tb_aws_provider.transform(dataset_name=provider_name, client_name=provider_client))\
    )

    task_group_context = task_group.start()

    print(transform_intersect_session.id) # Prints the session ID for reference

571b9533b0

Wait for the session to complete. Session details are displayed on the Sessions page in your workspace UI. Use the session ID to identify your session.

EDA Intersect¶

Step 1: Specify the data config and the paired columns to analyze.

To perform bi-variate analysis between two specific columns between two datasets, specify the list of columns from both vendor and carrier.

2D histograms are generated for all pairs from these two lists to allow you to perform bi-variate analysis.

Note that the more columns you specify here, the heavier the computation and the longer the session run time. Therefore we recommend only including the columns that require bi-variate analysis.

[ ]:

eda_data_config = {consumer_client: consumer_bivariate_features,
                   provider_client: provider_univariate_features
                  }

pair_cols = {consumer_client: consumer_bivariate_features,
             provider_client: provider_bivariate_features
            }

#Optional - load an existing session instead of running a new one.
existing_eda_intersect_session_id = None

Step 2: Create and start an EDA intersect session. You can edit the session name and description. There is no need to edit other parameters in this example.

[28]:

if existing_eda_intersect_session_id:
    eda_session = client.session(existing_eda_intersect_session_id)
    print(eda_session.id)     #Prints the session ID for reference
else:
    eda_session = client.create_eda_session(
        name="Phase 1 eval - EDA intersect",
        description="Phase 1 eval - EDA intersect",
        data_config=eda_data_config,
        eda_mode="intersect",
        transform_session_id="23317362d1",
        prl_session_id=prl_session.id,
        cross_client_2d_pairs = pair_cols
    ).start()

    task_group = (
        SessionTaskGroup(eda_session)
        .add_task(iai_tb_aws_consumer.eda(dataset_name=consumer_train_name, client_name=consumer_client))\
        .add_task(iai_tb_aws_provider.eda(dataset_name=provider_name, client_name=provider_client))\
    )

    task_group_context = task_group.start()

    print(eda_session.id)      #Prints the session ID for reference

a852cc0bb2

Wait for the session to complete. Session details are displayed on the Sessions page in your workspace UI. Use the session ID to identify your session.

[29]:

provider_raw_individual_eda = raw_eda_individual_session.results()[provider_client]
provider_raw_intersect_eda = raw_eda_intersect_session.results()[provider_client]

intersect_results = eda_session.results()
provider_eda = intersect_results[provider_client]
consumer_eda = intersect_results[consumer_client]
df_info = intersect_results.info()

Univariate Analysis¶

We can check the vendor feature distribution after overlapping provider data with the vendor data, and also examine the potential distribution change before/after the data matching and before/after the feature preprocessing.

Feature distribution¶

Example EDA job: check the summary statistics of the vendor numerical features for the overlapped records.

[30]:

provider_eda.describe().T

[30]:

	count	mean	std	min	25%	50%	75%	max
height_imputed	360869.0	8.252904	2.460404	5.0	6.020979	8.002974	9.981829	24.0

Check the summary statistics of the vendor categorical features for the overlapped records.

[ ]:

provider_eda.describe(categorical=True).T

Example EDA job: Display histograms for all features.

[31]:

hist = provider_eda.plot_hist()

../_images/notebooks_Phase1_eval_guide_template_51_0.png

Check the distribution of the target variable after matching.

[32]:

consumer_eda[target].describe()

[32]:

count    3.608670e+05
mean     4.182813e+03
std      1.851324e+05
min      0.000000e+00
25%      3.840202e-57
50%      2.375498e-26
75%      2.441010e-08
max      7.775760e+07
Name: target, dtype: float64

Feature distribution comparison¶

EDA job: Compare by summary statistics a feature distribution between the whole dataset and the overlapped portion with/without preprocessing.

For the col, specify any continuous vendor feature name that you want to compare.

[33]:

col = 'height_imputed'

df = pd.concat([provider_raw_individual_eda[col].describe(), provider_raw_intersect_eda[col].describe(), provider_eda[col].describe()], axis=1)
df.columns = ['individual (raw)', 'intersect (raw)', 'intersect (preprocessed)']
df

[33]:

	individual (raw)	intersect (raw)	intersect (preprocessed)
count	4.999995e+06	360866.000000	360869.000000
mean	8.147706e+00	8.257403	8.252904
std	2.448904e+00	2.454346	2.460404
min	5.000000e+00	5.000000	5.000000
25%	6.012627e+00	6.020976	6.020979
50%	7.996590e+00	8.002972	8.002974
75%	9.968646e+00	9.981819	9.981829
max	2.500000e+01	24.000000	24.000000

EDA job: Compare feature distribution between the whole dataset and the overlapped portion with/without preprocessing.

For the col, specify the feature name you to compare the distribution for.

[34]:

col = 'use_imputed'

height_raw_individual, bins_raw_individual = provider_raw_individual_eda[col].counts, provider_raw_individual_eda[col].bins
height_raw_intersect, bins_raw_intersect = provider_raw_intersect_eda[col].counts, provider_raw_intersect_eda[col].bins
height_intersect, bins_intersect = provider_eda[col].counts, provider_eda[col].bins

if (df_info[df_info['Column']==col].Dtype=='Categorical').all():
    sorted_raw_individual_pairs = sorted(zip(bins_raw_individual, height_raw_individual))
    plot_bins_raw_individual, plot_height_raw_individual = zip(*sorted_raw_individual_pairs)
    sorted_raw_intersect_pairs = sorted(zip(bins_raw_intersect, height_raw_intersect))
    plot_bins_raw_intersect, plot_height_raw_intersect = zip(*sorted_raw_intersect_pairs)
    sorted_intersect_pairs = sorted(zip(bins_intersect, height_intersect))
    plot_bins_intersect, plot_height_intersect = zip(*sorted_intersect_pairs)
else:
    plot_bins_raw_individual = bins_raw_individual[:-1]
    plot_height_raw_individual = height_raw_individual
    plot_bins_raw_intersect = bins_raw_intersect[:-1]
    plot_height_raw_intersect = height_raw_intersect
    plot_bins_intersect = bins_intersect[:-1]
    plot_height_intersect = height_intersect

fig, ax = plt.subplots(1,3, figsize=(15, 4))
ax[0].bar(plot_bins_raw_individual, plot_height_raw_individual)
ax[0].set_title('Whole dataset (raw)')

ax[1].bar(plot_bins_raw_intersect, plot_height_raw_intersect)
ax[1].set_title('Overlapped data (raw)')

ax[2].bar(plot_bins_intersect, plot_height_intersect)
ax[2].set_title('Overlapped data (preprocessed)')

plt.tight_layout()
plt.show()

../_images/notebooks_Phase1_eval_guide_template_57_0.png

Bivariate Analysis¶

We can further examine the relationship between vendor features and target variable through some bivariate analysis, including

groupby (for categorical vendor features)
correlation (for continuous vendor features)

EDA job: Group By

[40]:

results = client.session("a852cc0bb2").results()
results.info()

[40]:

	Dataset Name	Column	Non-Null Count	Total Count	Dtype
0	realistic_provider	storeys_imputed	360868	360868	Categorical
1	realistic_provider	use_imputed	360868	360868	Categorical
2	realistic_provider	has_basement_imputed	360867	360867	Categorical
3	realistic_provider	height_imputed	360869	360869	Continuous
4	realistic_provider	constructions_imputed	360868	360868	Categorical
5	realistic_provider	floor_level_imputed	360868	360868	Categorical
6	syn_carrier	target	360867	360867	Continuous

[41]:

provider_cat_cols = list(df_info[(df_info.Dtype=='Categorical') & (df_info['Dataset Name']==provider_client)].Column)

fig, ax = plt.subplots(len(provider_cat_cols), 1, figsize=(10, 5*len(provider_cat_cols)))

for i in range(len(provider_cat_cols)):
    results.groupby(provider_eda[provider_cat_cols[i]])[consumer_eda[target]].mean().plot(kind='bar', ax=ax[i], title=provider_cat_cols[i])

plt.tight_layout()
plt.show()

../_images/notebooks_Phase1_eval_guide_template_60_0.png