Likelihood Ratio Testing (LRT)¶
LRT is a statistical approach that is used for the variable selection. It performs the likelihood ratio test on a full model (with all features as predictors) and a reduced model (with all features except for one particular feature (X)), to test the significance of the feature X.
The lrt_result is a pandas series, that is, a single column table, with rows corresponding to features (the keys in likelihood_ratio_test_group) and the column value being the p-value from the LRT.
If the LRT provides a p-value<0.05, feature X is significant.
LRT Prerequisites¶
To run LRT, you must have the following:
A good full VFL model
VFL model data config
VFL model config
VFL model session metrics
LRT Example¶
Specify the data_config from the VFL model.
data_config = {
provider_name: {
"label_client": False,
"predictors": provider_features,
"target": None,
"backend": {"name": "pandas"},
},
consumer_train_name: {
"label_client": True,
"predictors": consumer_features,
"target": target,
"sample_weight": exposure,
"backend": {"name": "pandas"},
},
}
Specify the model_config from the VFL model.
model_config = {
"strategy": {"name": "VflGlm", "params": {}},
"model": {
provider_name: {"params": {"input_size": len(provider_features), "output_activation": "exp"}},
consumer_train_name: {"params": {"input_size": len(consumer_features), "output_activation": "exp"}},
},
"ml_task": {
"type": "regression",
"loss_function": "poisson",
"params": {},
},
'lrscheduler': {"name": "onecyclelr", "params": {"max_lr": 0.1}},
"optimizer": {"name": "Adam", "params": {"learning_rate": 0.05}},
"init_params": {consumer_train_name: baseline_init},
"seed": 42,
"calculate_training_metrics": True
}
likelihood_ratio_test_group = {"has_basement":["has_basement_yes"],
"use":["use_P", "use_R"],
"constructions":["constructions_CW",
"constructions_PF",
"constructions_SF",
"constructions_SM",
"constructions_TF"],
"floor_level":["floor_level_norm"],
"height":["height_norm"],
"storeys":["storeys_norm"],
"carrier_num_feature_1": ['carrier_num_feature_1_norm'],
"carrier_num_feature_2": ['carrier_num_feature_2_norm'],
"carrier_num_feature_3": ['carrier_num_feature_3_norm'],
"carrier_num_feature_4": ['carrier_num_feature_4_norm'],
"carrier_cat_feature_1": ['carrier_cat_feature_1_A', 'carrier_cat_feature_1_B'],
"carrier_cat_feature_2": ['carrier_cat_feature_2_A', 'carrier_cat_feature_2_B'],
"carrier_cat_feature_3": ['carrier_cat_feature_3_A', 'carrier_cat_feature_3_B']}
Run LRT with the batch runner¶
batch_id, batch_sessions = client.run_vfl_batch(
name="Testing VFL batch lrt",
description="batching VFL",
prl_session_id=prl_session.id,
vfl_mode='train',
min_num_clients=2,
num_rounds=rounds,
package_name="iai_glm",
data_config=data_config,
model_config=model_config,
data_strategy="lrt",
group_config=likelihood_ratio_test_group,
# How many sessions to run in parallel
capacity=5,
tasks=[
iai_tb_aws_consumer.vfl_train(train_dataset_name=consumer_train_name,
test_dataset_name=consumer_test_name,
client_name=consumer_train_name,
batch_size=v_batch_size),
iai_tb_aws_provider.vfl_train(train_dataset_name=provider_name,
test_dataset_name=provider_name,
client_name=provider_name,
batch_size=v_batch_size),
],
)
Calculate the test statistics¶
from scipy.stats.distributions import chi2
import pandas as pd
# The vfl_metrics here must be the metrics of the FULL VFL MODEL, which should already be available
llf_full = vfl_metrics['train_loglike']
lrt_pval = {}
# For each LRT model in the batch
# batch_session is in the format of {session_id: session_obj}
for f in batch_sessions.keys():
llf_reduced = list(batch_sessions[f].metrics().as_dict()['client_metrics'][-1].values())[0]['train_loglike']
LRT = -2*(llf_reduced - llf_full)
p_val = chi2.sf(LRT, 1)
lrt_pval[f]=p_val
lrt_result = pd.Series(lrt_pval, name='LRT')
The lrt_result is a pandas series, that is, a single column table with rows corresponding features (the keys in likelihood_ratio_test_group) and the column value being the p-value.