Transform Sessions

Data preprocessing is the process of transforming raw data into a structured format suitable for machine learning models. The integrate.ai platform provides a transform session type that can perform data preprocessing (imputing missing values, standardization, categorical encoding) and basic feature engineering (for example, cardinality reduction) to help you improve model accuracy and efficiency by ensuring the data is clean, consistent, and understandable by algorithms.

Using the Transform Session

The integrate.ai platform provides built-in transformers that are based on preprocessing transformers supported in sklearn.

Available transformers:

Standardization

Standardization of datasets is useful to improve the convergence behaviour of gradient-based machine learning methods. It involves centering the mean around 0 and scaling the variance to 1.

The IAIStandardScaler transform performs the centering and scaling functions of preprocessing.

  • For GLM and NN models trained with SGD, standardization reduces the risk of exploding or vanishing gradients (which can trigger session failure).

  • For GBM models standardization is not necessary, as it is not gradient-based.

It is possible to disable either centering or scaling by setting the appropriate boolean value in the config.

with_mean: Optional[bool] = True
with_std: Optional[bool] = True

Encoding Categorical Values

The one-hot encoder is a one-of-K, or dummy encoding. This type of encoding transforms each categorical feature with n_categories possible values into n_categories binary features, where one category is a one (1), and all others are zero (0). By default, the values each feature can take are inferred automatically from the dataset and can be found in the categories_ attribute.

The IAIOneHotEncoder creates one category for each level and is typically used before modelling.

There are a number of parameters you can use to adjust the encoding. These parameters are the same as are available for the sklearn OneHotEncoder implementation.

For example, the most commonly used parameters include:

  • drop

  • min_frequency

  • max_categories

  • handle_unknown

Cardinality Reduction

Similar to the IAIOneHotEncoder, the IAICardinalityReducer is based on the cardinality reduction method of the sklearn OneHotEncoder. Instead of creating multiple columns, it produces just one column (like the IAIStandardScaler) and the result column is still categorical.

The number of levels can be controlled by the max_categories column. Specify the max_categories as a fraction between (0, 1) to set the threshold for cumulative frequency.

Imputation of Missing Values

The OHE and StandardScaler handle missing values automatically. The OHE treats missing values as a separate category and puts them in an extra column (for example, a, b, c, missing). The StandardScaler transform imputes the value as the mean for missing values in the raw data.

This automatic imputation is useful for many cases however, if you take the example of a feature that contains the number of floors per building (e.g., numeric values such as 1, 2, 3, 4, 5 as integers), when the standard scaler imputes this value as the mean, you may end up with a decimal value that does not make sense (e.g. floors=2.5). This does not affect the model, which treats the value as a continuous variable. If and EDA session is run on this imputed data, the value will be visible and may not match the expected values.

The Simple Imputer provides greater flexibility because it allows you to manually control how missing values are handled. The IAISimpleImputer is based on the sklearn SimpleImputer class.

Typically, imputation is done before other transformations as part of the pipeline. The example below shows two imputations, followed by one-hot encoding and normalization.

Bins Discretizer

The bins discretizer enables you to bin continuous data into intervals.

The bins discretizer implements different binning strategies that can be selected with the strategy parameter. The ‘uniform’ strategy uses constant-width bins. The ‘quantile’ strategy uses the quantiles values to have equally populated bins in each feature. The ‘kmeans’ strategy defines bins based on a k-means clustering procedure performed on each feature independently. The binedges strategy allows you to manually specify the cutoff points of your bins as an array, which you specify thru the bins argument.

For more details, see the sklearn K-bins discretization documentation.

Transform Session Data Configuration

The transform session type requires a data_config file that specifies the type and parameters of transformations to be performed.

Example:

transform_data_config = {
    carrier_name: [
        {
            "type": "IAISimpleImputer",
            "columns": ["carrier_cat_feature1", "carrier_cat_feature2"],
            "params": {"missing_values": "2", "strategy": "most_frequent"}          #impute categorical variables with most_frequest category
        }
        {
            "type": "IAISimpleImputer",
            "columns": ["carrier_num_feature1", "carrier_num_feature2", "carrier_num_feature3"],
            "params": {"missing_values": 99.00, "strategy": "mean"}                 #impute numerical variables with the mean
        }
        {
            "type": "IAIOneHotEncoder",
            "columns": ["carrier_cat_feature_1", "carrier_cat_feature_2", "carrier_num_feature_1"],
            "params": {"drop": "first", "handle_unknown": "infrequent_if_exist", "max_categories": 0.9},
        },
        {
            "type": "IAIStandardScaler",
            "columns": ["carrier_num_feature_1", "carrier_num_feature_2", "carrier_num_feature_3"],
            "params": {},
        },
    ],
    provider_name: [
        {
            "type": "IAISimpleImputer",
            "columns": ["has_basement", "use", "constructions"],
            "params": {"missing_values": None, "strategy": "most_frequent"},
        },
        {
            "type": "IAISimpleImputer",
            "columns": ["floor_level", "height", "storeys"],
            "params": {"strategy": "median"},
        },
        {
            "type": "IAIOneHotEncoder",
            "columns": ["has_basement", "use", "constructions"],
            "params": {
                "max_categories": 0.9,
                "drop": "first",
                "handle_unknown": "infrequent_if_exist",
            },
        },
        {
            "type": "IAIStandardScaler",
            "columns": ["floor_level", "height", "storeys"],
            "params": {},
        },
    ]
}

Parameters:

missing_values = indicates which value(s) should be considered missing. By default, numpy nan values and the float nan values (any non-values) are treated as missing. strategy = indicates how to impute the missing values. Options are mean, median, most_frequent, and constant. For a constant, you must also specify a field value.

Transform Session Example

There are two transform modes available:

  • individual mode specifies that the transform is performed individually on each client in the transform session

  • intersect mode specifies that the transform is operating on the intersection of data obtained from a trained PRL session. In this mode, you must also specify the PRL session ID.

transform_session = client.create_transform_session(
    name="test_transform",
    description="",
    data_config=transform_data_config,
    # transform_mode="individual",
    transform_mode="intersect",
    prl_session_id=prl_session_id,
).start()

transform_session.id