Differential Privacy

Differential Privacy¶

What is differential privacy?¶

Differential privacy is a technique for providing a provable measure of how “private” a data set can be. This is achieved by adding a certain amount of noise when responding to a query on the data. A balance needs to be struck between adding too much noise (making the computation less useful), and too little (reducing the privacy of the underlying data).

The technique introduces the concept of a privacy-loss parameter (typically represented by ε (epsilon)), which can be thought of as the amount of noise to add for each invocation of some computation on data. A related concept is the privacy budget, which can be chosen by the data curator.

This privacy budget represents the measurable amount of information exposed by the computation before the level of privacy is deemed unacceptable.

The benefit of this approach is that by providing a quantifiable measure, there can be a guarantee about how “private” the release of information is. However, in practice, relating the actual privacy to the computation in question can be difficult: i.e. how private is private enough? What will ε need to be? These are open questions in practice for practitioners when applying DP to an application.

How is Differential Privacy used in integrate.ai?¶

Users can add Differential Privacy to any model built in integrate.ai. DP parameters can be specified during session creation, within the model configuration.

When is the right time to use Differential Privacy?¶

Overall, differential privacy can be best applied when there is a clear ability to select the correct ε for the computation, and/or it is acceptable to specify a large enough privacy loss budget to satisfy the computation needs.

PRL Privacy for VFL¶

Private record linkage is a secure method for determining the overlap between datasets

In vertical federated learning (VFL), the datasets shared by any two parties must have some overlap and alignment in order to be used for machine learning tasks. There are typically two main problems:

The identifiers between the two datasets are not fully overlapped.
The rows of the filtered, overlapped records for the datasets are not in the same order.

To resolve these differences while maintaining privacy, integrate.ai applies private record linkage (PRL), which consists of two steps: determining the overlap (or intersection) and aligning the rows.

Private Set Intersection¶

First, Private Set Intersection (PSI) is used to determine the overlap without storing the raw data centrally, or exposing it to any party. PSI is a privacy-preserving technique that is considered a Secure multiparty computation (SMPC) technology. This type of technology uses cryptographic techniques to enable multiple parties to perform operations on disparate datasets without revealing the underlying data to any party.

Additionally, integrate.ai does not store the raw data on a server. Instead, the parties submit the paths to their datasets. integrate.ai uses the ID columns of the datasets to produce a hash locally that is sent to the server for comparison. The secret for this hash is shared only through Diffie–Hellman key exchange between the clients - the server has no knowledge of it.

Private Record Alignment¶

Once the ID columns are compared, the server knows which dataset indices are common between the two sets and can align the rows. This step is the private record alignment portion of PRL. It enables machine learning tasks to be performed on the datasets.

For more information about running PRL sessions, see PRL Sessions.