Likelihood Ratio Testing (LRT)

LRT is a statistical approach that is used for the variable selection. It performs the likelihood ratio test on a full model (with all features as predictors) and a reduced model (with all features except for one particular feature (X)), to test the significance of the feature X.

The lrt_result is a pandas series, that is, a single column table, with rows corresponding to features (the keys in likelihood_ratio_test_group) and the column value being the p-value from the LRT.

If the LRT provides a p-value<0.05, feature X is significant.

LRT Prerequisites

To run LRT, you must have the following:

  • A good full VFL model

  • VFL model data config

  • VFL model config

  • VFL model session metrics

LRT Example

Specify the data_config from the VFL model.

data_config = {
        provider_name: {
            "label_client": False,
            "predictors": provider_features,
            "target": None,
            "backend": {"name": "pandas"},
        },
        consumer_train_name: {
            "label_client": True,
            "predictors": consumer_features,
            "target": target,
            "sample_weight": exposure,
            "backend": {"name": "pandas"},
        },
    }

Specify the model_config from the VFL model.

model_config = {
    "strategy": {"name": "VflGlm", "params": {}},
    "model": {
        provider_name: {"params": {"input_size": len(provider_features), "output_activation": "exp"}},
        consumer_train_name: {"params": {"input_size": len(consumer_features), "output_activation": "exp"}},
    },
    "ml_task": {
        "type": "regression",
        "loss_function": "poisson",
        "params": {},
    },
    'lrscheduler': {"name": "onecyclelr", "params": {"max_lr": 0.1}},
    "optimizer": {"name": "Adam", "params": {"learning_rate": 0.05}},
    "init_params": {consumer_train_name: baseline_init},
    "seed": 42,
    "calculate_training_metrics": True
}

likelihood_ratio_test_group = {"has_basement":["has_basement_yes"],
                               "use":["use_P", "use_R"],
                               "constructions":["constructions_CW",
                                                "constructions_PF",
                                                "constructions_SF",
                                                "constructions_SM",
                                                "constructions_TF"],
                               "floor_level":["floor_level_norm"], 
                               "height":["height_norm"],
                               "storeys":["storeys_norm"],
                               "carrier_num_feature_1": ['carrier_num_feature_1_norm'],
                               "carrier_num_feature_2": ['carrier_num_feature_2_norm'],
                               "carrier_num_feature_3": ['carrier_num_feature_3_norm'],
                               "carrier_num_feature_4": ['carrier_num_feature_4_norm'],
                               "carrier_cat_feature_1": ['carrier_cat_feature_1_A', 'carrier_cat_feature_1_B'],
                               "carrier_cat_feature_2": ['carrier_cat_feature_2_A', 'carrier_cat_feature_2_B'],
                               "carrier_cat_feature_3": ['carrier_cat_feature_3_A', 'carrier_cat_feature_3_B']}

Run LRT with the batch runner

batch_id, batch_sessions = client.run_vfl_batch(
    name="Testing VFL batch lrt",
    description="batching VFL",
    prl_session_id=prl_session.id,
    vfl_mode='train',
    min_num_clients=2,
    num_rounds=rounds,
    package_name="iai_glm",
    data_config=data_config,
    model_config=model_config,
    data_strategy="lrt",
    group_config=likelihood_ratio_test_group,
    # How many sessions to run in parallel
    capacity=5,
    tasks=[
        iai_tb_aws_consumer.vfl_train(train_dataset_name=consumer_train_name,
                                    test_dataset_name=consumer_test_name,
                                    client_name=consumer_train_name,                    
                                    batch_size=v_batch_size),
        iai_tb_aws_provider.vfl_train(train_dataset_name=provider_name,     
                                    test_dataset_name=provider_name,
                                    client_name=provider_name,                    
                                    batch_size=v_batch_size),
    ],
)

Calculate the test statistics

from scipy.stats.distributions import chi2
import pandas as pd

# The vfl_metrics here must be the metrics of the FULL VFL MODEL, which should already be available

llf_full = vfl_metrics['train_loglike']

lrt_pval = {}

# For each LRT model in the batch
# batch_session is in the format of {session_id: session_obj}

for f in batch_sessions.keys():
    llf_reduced = list(batch_sessions[f].metrics().as_dict()['client_metrics'][-1].values())[0]['train_loglike']
    LRT = -2*(llf_reduced - llf_full)
    p_val = chi2.sf(LRT, 1)
    lrt_pval[f]=p_val

lrt_result = pd.Series(lrt_pval, name='LRT')

The lrt_result is a pandas series, that is, a single column table with rows corresponding features (the keys in likelihood_ratio_test_group) and the column value being the p-value.