Data Requirements

Data Requirements¶

To run a session, the data must be prepared according to the following standards:

The data configuration dictated in the Session configuration
The data requirements given below for running a modelling pipeline with integrate.ai

integrate.ai supports

Private Record Linkage (PRL)
- Discover the overlapping population between your data and your partner’s data without exposing individual records
Exploratory Data Analysis (EDA)
- Generate summary statistics on a dataset to learn about the properties and distributions of the dataset being evaluated
Horizontal federated learning (HFL)
- Train models across different siloed datasets that hold different samples over the same set of features, without transferring data between each silo
Vertical federated learning (VFL)
- Train models across different siloed datasets that hold different features belonging to the same set of samples, without transferring data between each silo

Data format requirements¶

Data should be in Apache Parquet format (preferred) or .csv, and can be hosted locally, in an S3 bucket, or in an Azure blob. Note: .csv is only recommended for data files less than 1GB.
Proper row groups must be used when exporting the data to parquet files. Depending on the tool used to write the parquet file, you can do so by specifying the row_group_size as the number of records per group, or set the block_size as the physical storage size per group (recommended value is 100MB).

a. The recommended value for row_group_size is 500k records. Large datasets are processed by moderate-sized partitions (row groups) by default. These partitions should be large enough that each of them contains sufficient information about the whole data. However, it should not be too large, so that they can be processed efficiently with regard to memory and computation usage.
The matching columns (that is, the columns used in the matching algorithm between datasets for PRL) are defined (that is, you must specify which columns should be used for the match). For information about handling duplicated values of these matching columns, see the section Configure PRL session.
Column names must be consistent across datasets. All column names (predictors and targets) must contain only alphanumeric characters (letters, numbers, dash -, underscore _) and start with a letter. You can select which columns you want to use in a specific training session.

Feature engineering requirements¶

In order for data to be used to run predictive modelling sessions in the platform (ie GLMs, GBMs, NNs etc), it must be fully feature engineered. Note that the registered datasets must always include the original columns with the original name, as well as the feature engineered columns.

The requirements for the processed columns that will be used as input for modelling pipelines are as follows

All processed columns must be numerical, with the exception of ID columns used for PRL.
Processed columns must not contain NULL values; meaning that missing values are imputed.
Each row must correspond to one observation.
For processed columns that are continuous variables, they must be standardized to have mean = 0 and std = 1. This requirement is highly recommended for GLM and NN, but is not required for GBM.
For categorical variables, features must be encoded (e.g., by one-hot-encoding). For example, if there is a column marital_status, and the values are married and divorced, there should be three columns in the dataset: the original column marital_status, marital_status_married, and marital_status_divorced.
Feature engineering must be consistent across the data products. For example, if the datasets contain categorical values, such as country, these values must be encoded the same way across all the datasets. For the country example, this means that the same country value translates to the same numerical values across all datasets that will participate in the training.

Sample Weighting¶

By default, each sample has an equal weight when computing the loss and other subsequent quantities (such as the gradients). However, different sample weights may allow information about the known credibility of each observation to be incorporated in the model.

For example, if modeling claims frequency, one observation might relate to one month’s exposure, and another to one year’s exposure. There is more information and less variability in the observation relating to the longer exposure period, and this can be incorporated in the model by defining the sample weight to be the exposure of each observation. In this way observations with higher exposure are deemed to have lower variance, and the model will consequently be more influenced by these observations.

To use sample weights, you must add the paramater sample_weight to the data configuration.

In the code examples provided, the sample weights column is named exposure.

Sample weight usage¶

# HFL sample weight example

data_config = {
    "predictors": ...,
    "target": ...,
    "sample_weight": "exposure"
}

# VFL sample weight example

data_config = {
    "active": {
        "predictors": ...,
        "target": ...,
        "sample_weight": "exposure"
    },
    "passive": {
        ...
    }
}

Sample weights are required to be strictly positive. No other preprocessing is necessary. In both HFL and VFL, the weights will be automatically normalised to sum to 1 within each batch, to ensure consistent scaling of the loss across batches.

For VFL, the sample weights can only be provided by the active party because they are the ones computing the loss and other byproducts (e.g., gradients).

Back to Quickstart

Data Pre-Processing Overview¶

Pre-processing recommendations¶

It is best practice to include data pre-processing information when connecting data to a partner. Below is a recommended template for communicating data pre-processing steps, describing all of the data pre-processing steps performed that go beyond the standard requirements for the integrate.ai platform, which are listed here.

1. Handling of Nulls/Missing Values

1A) Imputation of Null Values

For every feature in your dataset, describe how missing or null values are handled. Specify whether values are imputed, removed, or replaced with default values, and provide details on the methods or logic used for imputation.

Example: “For the income column, missing values were imputed using the median income of the respective age group. The employment_status column had 5% missing data, which was replaced with ‘Unknown’.”

1B) Fill Rate (optional) If available, provide fill rates for each column before any processing steps are applied.

Example: “Column fill rates before processing:

age: 98%
income: 90%
employment_status: 95%”

2. Additional Data Pre-Processing

Please describe any data pre-processing steps performed beyond what is already in your data dictionary and what is beyond the standard requirements outlined by integrate.ai. Include whatever you think would be valuable context to share with your partner. For example:

Additional Normalization Techniques: For instance, using min-max normalization instead of the standard scaler.
Outlier Filtering: Excluding rows with values in a specific range.
Transformations for Abnormal Data Distributions: Applying techniques such as log transformation, Box-Cox transformation, or power transformation to adjust skewed data distributions.

Example: “We applied min-max normalization to numerical fields to scale values between 0 and 1. Additionally, outliers beyond three standard deviations were removed from the dataset to maintain consistency.”