EDA in Individual Mode

The Exploratory Data Analysis (EDA) feature for horizontal federated learning (HFL) enables you to access summary statistics about a group of datasets without needing access to the data itself. This allows you to get a basic understanding of the dataset when you don’t have access to the data or you are not allowed to do any computations on the data.

EDA is an important pre-step for federated modelling and a simple form of federated analytics. The feature has a built in differential privacy setting. Differential privacy (DP) is dynamically added to each histogram that is generated for each feature in a participating dataset. The added privacy protection causes slight noise in the end result.

The sample notebook (integrate_ai_api.ipynb) provides runnable code examples for exploring the API, including the EDA feature, and should be used in parallel with this tutorial. This documentation provides supplementary and conceptual information to expand on the code demonstration.

API Reference for EDA

The core API module that contains the EDA-specific functionality is integrate_ai_sdk.api.eda. This module includes a core object called EdaResults, which contains the results of an EDA session.

If you are comfortable working with the integrate.ai SDK and API, see the API Documentation for details.

This tutorial assumes that you have correctly configured your environment for working with integrate.ai, as described in Using integrate.ai.

Configure an EDA Session

To begin exploratory data analysis, you must first create a session.

The eda_data_config file is a configuration file that maps the name of one or more datasets to the columns to be pulled. Dataset names and column names are specified as key-value pairs in the file.

For each pair, the keys are dataset names that are expected for the EDA analysis. The values are a list of corresponding columns. The list of columns can be specified as column names (strings), column indices (integers), or a blank list to retrieve all columns from that particular dataset.

If a dataset name is not included in the configuration file, all columns from that dataset are used by default.

For example:

To retrieve all columns for a submitted dataset named dataset_one:

eda_data_config = {"dataset_one": []}

To retrieve the first column and the column x2 for a submitted dataset named dataset_one:

eda_data_config = {"dataset_one": [1,"x2"]}

To retrieve the first column and the column x2 for a submitted dataset named dataset_one and all columns in a dataset named dataset_two:

eda_data_config = {"dataset_one": [1,"x2"],"dataset_two": []}

Specifying the number of datasets

You can manually set the number of datasets that are expected to be submitted for an EDA session by specifying a value for num_datasets.

If num_datasets is not specified, the number is inferred from the number of datasets provided in eda_data_config.

Create and start an EDA session

The code sample demonstrates creating and starting an EDA session to perform privacy-preserving data analysis on two datasets, named dataset_one and dataset_two. It returns an EDA session ID that you can use to track and reference your session.

eda_data_config = {"dataset_one": [1,"x5","x7"], "dataset_two": ["x1","x10","x11"]}

eda_session = client.create_eda_session(
    name="EDA Individual Session",
    description="Testing EDA individual mode through a notebook",
    data_config=eda_data_config,
    eda_mode="individual",               #Generates histograms on single nodes
    single_client_2d_pairs = None,      #Optional - only required to 2d histograms 
).start()

eda_session.id

Session parameters:

  • eda_mode = One of {‘individual’,’intersect’}. Defaults to ‘individual’.

  • single_client_2d_pairs (Dict, optional): a data_config like dict with the column names to use to generate 2d-histograms when both columns belong to same dataset. This option only considers the single node 2d-histograms and is valid for both ‘intersect’ and ‘individual’ mode. Defaults to None.

Note that, unlike other sessions, you do not need to specify a model_config file for EDA. This is because there are no editable parameters.

The eda_data_config used here specifies that the first column (1), column x5, and column x7 will be analyzed for dataset_one and columns x1, x10, and x11 will be analyzed for dataset_two.

Since the num_datasets argument is not provided to client.create_eda_session(), the number of datasets is inferred as two from the `eda_data_config.

For more information, see the create_eda_session() definition in the API documentation.

Analyze the EDA results

The results object is a dataset collection comprised of multiple datasets that can be retrieved by name. Each dataset is comprised of columns that can be retrieved by either column name or by index.

You can perform the same base analysis functions at the collection, dataset, or column level.

results = eda_session.results()

*Example output:

EDA Session collection of datasets: ['dataset_two', 'dataset_one']

Describe

Use the .describe() method to retrieve a standard set of descriptive statistics.

If a statistical function is invalid for a column (for example, mean requires a continuous column and x10 is categorical), or the column from one dataset is not present in the other (for example, here x5 is in dataset_one, but not dataset_two), then the result is NaN.

results.describe()

EDA describe

results["dataset_one"].describe()

EDA describe

Statistics

For categorical columns, you can use other statistics for further exploration. For example, unique_count, mode, and uniques.

results["dataset_one"][["x10", "x11"]].uniques()

EDA statistics

You can call functions such as .mean(), .median(), and .std() individually.

results["dataset_one"].mean()

EDA mean

results["dataset_one"]["x1"].mean()

EDA describe

Histograms

You can create histogram plots using the .plot_hist() function.

saved_dataset_one_hist_plots = results["dataset_two"].plot_hist()

EDA histogram

EDA histogram 2

single_hist = results["dataset_two"]["x1"].plot_hist()

EDA single histogram

Back to Data Analysis Overview