Using Integrate.ai¶
After the IT Administrator has deployed the required components to the cloud environment, the Workspace Administator can log in and start adding users and registering a task runner.
Workspace Administrator Workflow¶
Workspace Administrators have full control over the entire workspace, from adding and removing users and assigning roles, to controlling administrative and billing information. There must always be at least one user with this role to manage the workspace.
Invite users to the workspace.
Register an AWS task runner or Azure task runner for the data custodians and model builders to use to register datasets and perform model training.
If the enterprise IT landscape requires ingress and egress exceptions for firewalls, or other specific configuration, provide those details during the task runner registration in the Advanced settings.
Register an AWS Task Runner¶
Task runners simplify the process of running training sessions on your data.
Note: before attempting to register a task runner, ensure you have completed the AWS configuration for task runners.
To register an AWS task runner:
Log in to your integrate.ai workspace.
In the left navigation bar, click Settings.
Under Workspace, click Task Runners.
Click Register to start registering a new task runner.
Select the service provider - Amazon Web Services.
Provide the following information:
Task runner name- must be uniqueProvisioner role ARN- the ARN created by the IT Administrator.Runtime role ARN- the ARN created by the IT Administrator.Region- select the AWS region to run in from the dropdownStorage Path- by default the task runner creates a bucket for you to upload data into (e.g.s3://{aws_taskrunner_profile}-{aws_taskrunner_name}.integrate.ai).
Only the default S3 bucket and other buckets ending in*integrate.aiare supported. If you are not using the default bucket created by the task runner when it was provisioned, ensure that your data is hosted in an S3 bucket with a URL ending in*integrate.ai. Otherwise, the data will not be accessible to the task runner.vCPU- the number of virtual CPUs the task runner is allowed to use. The default is 8.Memory size (GB)- the amount of memory to allocate to the task runners. The default is 32GB.
Click Save. Wait for the status to change to Online.
Optional Advanced settings for AWS task runners¶
There are several options in the Advanced settings section of the form that enable you to have fine-grain control over the task runner.
Container registry URL- Provide the URL to the S3 bucket containing the integrate.ai client image.
The format is:s3://<bucket URL>/<image name>.Use an existing VPC- Provide the following information for your existing VPC configuration.Existing VPC IDExisting VPC public subnetsExisting VPC private subnetsExisting client security groupExisting server security group
Create a new VPC in a different CIDR block- Provide the following information to create a new VPC in a specified CIDR block.Custom VPC CIDRCustom private subnet CIDRCustom public subnet CIDRCustom CIDR newbits
Use existing KMS keys- Provide the following infomation to use your own KMS keys instead of those generated by integrate.ai for the task runner.KMS data IDKMS secret ID
Use golden AMI- Provide the AMI ID for the golden AMI.
After successfully creating a task runner, you can use it to perform training tasks. You can reuse the task runner; there is no need to create a new one for each task.
Register a dataset (AWS)¶
Register your dataset through the workspace, by following the steps below.
Log in to your integrate.ai workspace.
Click Datasets in the left navigation bar.
On the Datasets page, click Register dataset.
On the Dataset overview page, type a name and description for the dataset.
Select an AWS task runner from the drop-down to manage tasks related to your dataset.
In the Dataset path field, specify the URI of the dataset, using the
s3://format. Ensure that the prepared Parquet or CSV file(s) is located in the S3 bucket that your task runner has access to.(Optional) Attach a Data Dictionary for the dataset. Upload a CSV or JSON file with the following column headers:
column_name (standard column/feature name)
friendly_name (human-readable name)
description (describe what the feature represents)
is_categorical (true/false - specify if the feature is categorical)
exclude (true/false - specify if the column should be excluded from EDA).
(Optional) If you are going to share the dataset with a partner, upload a Data pre-procesing document that describes how the dataset has been prepared before use. A template is provided in the UI.
(Optional) If you have additional metadata to associate with the dataset, upload it in the Attachments section.
Click Register.
Select a task runner to manage tasks related to your dataset.
Note: If no task runners exist, ask your Workspace Administrator or a Model Builder to create one.Click Next.
On the Dataset details and privacy controls page, type a name and description for the dataset.
Specify the URI of the dataset, using the
s3://format. Ensure that the prepared Parquet or CSV file(s) is located in the S3 bucket that your Task Runner has access to.(Optional) If you have metadata to associate with the dataset, upload it in the Attachments section.
Click Connect.
Your dataset is now registered and can be used in a notebook.
Register an Azure Task Runner¶
Task runners simplify the process of running training sessions on your data.
Note: before attempting to register a task runner, ensure you have completed the Azure configuration for task runners.
To register an Azure task runner:
Log in to your integrate.ai workspace.
In the left navigation bar, Click Settings.
Under Workspace, click Task Runners.
Click Register to start registering a new task runner.
Select the service provider - Microsoft Azure.
Provide the following information:
Task runner name- must be uniqueRegion- select from the listResource group- must be an existing dedicated resource groupService principal ID- this is theappIdfrom the Azure CLI output of creating a service principal.Service principal secret- this is thepasswordfrom the Azure CLI output.Runtime Service principal ID- this is theapplication IDof the App Registration created.Runtime Service principal secret- this is thesecretgenerated for the application.Subscription ID- the ID of your Microsoft Azure subscription. Can be found on the Azure dashboard.Tenant ID- this is thetenantIdfrom the Azure CLI output of creating a service principal.vCPU- the number of virtual CPUs the task runner is allowed to use. The default and maximum is 4.Memory size (MB)- the amount of memory to allocate to the task runners. The default and maximum is 16GB. This amount can be decreased, but not increased.Storage account- displays the default storage account name:storage.Storage path- displays the default storage account path:azure://sessionstorage.
Click Save. Wait for the status to change to Online.
Optional Advanced Settings for Azure task runners¶
There are several options in the Advanced settings section of the form that enable you to have fine-grain control over the task runner.
Container registry URL- Provide the URL to the Azure container registry containing the integrate.ai container images. The format is:<container registry>.azurecr.io/<image name>.
For example:iairepo.azurecr.io/edge/fl-client.
More information about container registries is available here.
Bind to an existing Azure Virtual Network- Enable this setting to provide the subnet ID for your existing Azure Virtual Network.Managed identity to access ACR- Managed Identityresource_idwith AcrPull built-in role in order to support Private Link Access for ACR
After successfully creating a task runner, you can use it to perform training tasks. You can reuse the task runner; there is no need to create a new one for each task.
Register a dataset (Azure)¶
Register your dataset through the workspace UI, by following the steps below.
Log in to your integrate.ai workspace.
Click Datasets in the left navigation bar.
On the Datasets page, click Register dataset.
On the Dataset overview page, type a name and description for the dataset.
Select an Azure task runner from the drop-down to manage tasks related to your dataset.
In the Dataset path field, specify the URI of the dataset, using the
azure://format. Ensure that the prepared Parquet or CSV file(s) is located in the Azure Blob that your task runner has access to.(Optional) Attach a Data Dictionary for the dataset. Upload a CSV or JSON file with the following column headers:
column_name (standard column/feature name)
friendly_name (human-readable name)
description (describe what the feature represents)
is_categorical (true/false - specify if the feature is categorical)
exclude (true/false - specify if the column should be excluded from EDA).
(Optional) If you are going to share the dataset with a partner, upload a Data pre-procesing document that describes how the dataset has been prepared before use. A template is provided in the UI.
(Optional) If you have additional metadata to associate with the dataset, upload it in the Attachments section.
Click Register.
Your dataset is now registered and can be used in a session.
Register a dataset (Snowflake)¶
Register your dataset through the workspace UI, by following the steps below.
Log in to your integrate.ai workspace.
Click Datasets in the left navigation bar.
On the Datasets page, click Register dataset.
On the Dataset overviews page, type a name and description for the dataset.
Select the snowflake task runner from the drop-down to manage tasks related to your dataset.
Click the Use Snowflake credentials toggle to enable the settings.
Provide the following information from your Snowflake acccount:
Role - the database user role that can query the data.
Database - the name of the database in Snowflake
Schema - the schema for the database
SQL - drag and drop a SQL file (**.sql*) onto the page to upload the query to load the Snowflake data.
(Optional) Attach a Data Dictionary for the dataset. Upload a CSV or JSON file with the following column headers:
column_name (standard column/feature name)
friendly_name (human-readable name)
description (describe what the feature represents)
is_categorical (true/false - specify if the feature is categorical)
exclude (true/false - specify if the column should be excluded from EDA).
(Optional) If you are going to share the dataset with a partner, upload a Data pre-procesing document that describes how the dataset has been prepared before use. A template is provided in the UI.
(Optional) If you have additional metadata to associate with the dataset, upload it in the Attachments section.
Click Register.
Your dataset is now registered and can be used in a session.