Principal Component Analysis (PCA)

Principal Component Analysis (PCA)¶

Principal Component Analysis (PCA) is a dimensionality reduction technique that helps rule out features that are not important to the model. In other words, it generates a transformation to apply on a dataset to simplify the number of features used while retaining enough information in the features to represent the original dataset.

The main use of PCA is to help researchers uncover the internal pattern underlying the data. This pattern is then helpful mainly in two applications:

Helps discover new groups and clusters among patients
More efficient modelling: achieving the same or even better predictive power with less input and simpler models.

The iai_pca session outputs a transformation matrix that is generated in a distributed manner. This transformation can be applied to each dataset of interest.

The underlying technology of this session type is FedSVD (federated singular value decomposition), which can be used in other dimensionality reduction methods.

For a real-world example of using a PCA session, see the pca_face_decomposition.ipynb notebook available in the sample repository. Sample data for running this notebook is available in a zip file here.

Configure a PCA Session¶

This tutorial assumes that you have correctly configured your environment for working with integrate.ai, as described in Installing the SDK and Using integrate.ai. It uses a locally run Docker container for the client. integrate.ai manages the server component.

This example describes how to train an HFL session with the built-in package iai_pca, which performs the principal component analysis on the raw input data, and outputs the desired number of principal axes in the feature space along with the corresponding eigenvalues. The resulting model can be used to transform data points from the raw input space to the principal component space.

Sample model config and data config for PCA¶

In the model_config, use the strategy FedPCA with the iai_pca package.

model_config = {
    "experiment_name": "test_pca",
    "experiment_description": "test_pca",
    "strategy": {"name": "FedPCA", "params": {}},
    "model": {
        "params": {
            "n_features": 10,   # optional
            "n_components": 3,
            "whiten": False
        }
    },
}

data_config = {
    "predictors": []
}

There are three parameters that can be configured:

n_features: the number of input features. Optional. If not specified, it is inferred based on the data config.
n_components: the number of principal components to consider. Can be an integer or a fraction between (0, 1). In the latter case, the number will be selected such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
whiten: When True (False by default) the principal axes will be modified to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometimes improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.

The example performs PCA on the 10 input features and outputs the first 3 principal axes.

Create and start a PCA training session¶

Note that the num_rounds parameter is ignored when iai_pca is used, because PCA sessions determine the number of rounds required automatically.

Make sure that the sample data you downloaded is either saved to your ~/Downloads directory, or you update the data_path below to point to the sample data.

data_path = "~/Downloads/synthetic"

pca_session = client.create_fl_session(
    name="Testing PCA ",
    description="Testing a PCA session through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_pca",
    model_config=model_config,
    data_config=data_config,
).start()

pca_session.id

Create a task group for the two clients.

from integrate_ai_sdk.taskgroup.taskbuilder import local
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

tb = local.local_docker(
    client,
    docker_login=False,
)

task_group_context = (
    SessionTaskGroup(pca_session)
    .add_task(tb.hfl(train_path=train_path1, test_path=test_path, client_name="client1"))\
    .add_task(tb.hfl(train_path=train_path2, test_path=test_path, client_name="client2"))\
    .start()
)

Wait for the session to complete successfully. This code will return True when it is complete.

task_group_context.wait(60 * 5)

View PCA session results¶

Note that the reverse process of mapping data points from the principal component space back to the original feature space can be treated as a multivariate regression task (i.e., reconstruct the raw features). We log some regression metrics (e.g., loss/mse and r2) for PCA sessions. They can be pulled as follows.

training_session.metrics().as_dict()

The PCA results can be retrieved as follows. It is a standard PyTorch model object, and the .state_dict() method can be used to show all stored parameters, including the principal axes matrix and eigenvalues.

pca_transformation = training_session.model().as_pytorch()
pca_transformation.state_dict()

We can also transform data points from the original feature space to the principal component space by directly calling the model object on the data tensors.

import pandas as pd

test_data = pd.read_parquet(f"{data_path}/test.parquet")
test_data.head()

import torch

test_data_tensor = torch.tensor(test_data[data_schema["predictors"]].values)
test_data_pc = pca_transformation(test_data_tensor)
test_data_pc.shape

Back to Data Analysis overview