Cluster onboarding

How it works

In order to perform automated cost optimization and security reporting, CAST AI needs to have access to your cluster. The following section describes the steps required to onboard a cluster to CAST AI and what actions our onboarding script performs in customer's account.

📘

Before onboarding the cluster ensure that CAST AI read only agent is already running.

Cluster onboarding is performed by the automated script that can be generated by clicking Enable CAST AI button or via API.

Following sections describe the prerequisites for the onboarding in each cloud supported cloud provider as well as actions performed by the onboarding script.

EKS

Prerequisites

  • AWS CLI - A command line tool for working with AWS services using commands in your command-line shell. For more
    information, see Installing AWS CLI.

  • jq – a lightweight command line JSON processor. For more information about the tool click here.

  • IAM permissions – The IAM security principal that you're using must have permissions to work with AWS EKS, AWS IAM, and related resources. Additionally, you should have access to the EKS cluster that you wish to onboard on the CAST AI console.

Example of least priveleged policy for administrator account (permissions needed to run onboarding script, used once per cluster during onboarding)

{
    "Action": [
        "iam:CreateRole",
        "iam:CreatePolicy",
        "iam:GetPolicy",
        "iam:ListPolicyVersions",
        "iam:PutRolePolicy",
        "iam:AttachRolePolicy",
        "iam:CreateInstanceProfile",
        "iam:GetInstanceProfile",
        "iam:AddRoleToInstanceProfile",
        "iam:UpdateAssumeRolePolicy"
    ]
}

Actions performed by the onboarding script

The script will perform the following actions:

  • Create cast-eks-*cluster-name* IAM user (if cross-account Role IAM is selected, an IAM role is created instead), with the required permissions to manage the cluster:

    • AmazonEC2ReadOnlyAccess
    • IAMReadOnlyAccess
    • Manage instances in specified cluster restricted to cluster VPC
    • Manage autoscaling groups in the specified cluster
    • Manage EKS Node Groups in the specified cluster
  • Create CastEKSPolicy policy used to manage EKS cluster. The policy contains the following permissions:

    • Create & delete instance profiles
    • Create & manage roles
    • Create & manage EC2 security groups, key pairs, and tags
    • Run EC2 instances
  • Create following roles:

    • cast-*cluster-name*-eks-####### to manage EKS nodes with following AWS managed permission policies applied :
      • AmazonEKSWorkerNodePolicy
      • AmazonEC2ContainterRegistryReadOnly
      • AmazonEKS_CNI_Policy
  • Modify aws-auth ConfigMap to map newly created IAM user to the cluster (skipped in case of cross-role IAM).

  • If a cross-account role IAM was not selected, AWS AccessKeyId and SecretAccessKey are created and printed, which then can be added to the CAST AI console and assigned to the corresponding EKS cluster. The AccessKeyId and SecretAccessKeyare used to by CAST to make programmatic calls to AWS and are stored in CAST AI's secret store that runs on Google's Secret manager solution.

  • With cross-account role IAM selected, a Role ARN is printed and sent to CAST AI console, which is then used by CAST AI to assume the role when making AWS programmatic calls.

📘

Scope of permissions

All the Write permissions are scoped to a single EKS cluster - it won't have access to resources of any other clusters in the AWS account.

Manual credential onboarding

To complete the steps mentioned above manually (without our script), be aware that when you create an Amazon EKS cluster, the IAM entity user or role (such as a federated user that creates the cluster) is automatically granted a system:masters permissions in the cluster's RBAC configuration in the control plane. To grant additional AWS users or roles the ability to interact with your cluster, you need to edit the aws-auth ConfigMap within Kubernetes. For more information, see Managing users or IAM roles for your cluster.

Usage of AWS services

CAST AI relies on the agent runs inside customer's cluster. The following services are consumed during the operation:

  • A portion of EC2 node resources from the customer's cluster. The CAST AI agent uses Cluster proportional vertical autoscaler to consume a minimum required resources depending on the size of the cluster
  • Low amount of network traffic to communicate with CAST AI SaaS
  • EC2 instances, their storage, and intra-cluster network traffic to manage Kubernetes cluster and perform autoscaling
  • IAM resources as detailed in the onboarding section

GKE

Prerequisites

  • gcloud - A command line tool for working with GKE services using commands in your command-line shell. For more
    information, see Installing gcloud.

  • IAM permissions – The IAM user that you're using must have:

    • Access to the project where the cluster is created.
    • Permissions to work with IAM, GKE, and compute resources.
    • The CAST AI agent has to be running on the cluster.

Actions performed by the onboarding script

The script will create new GKE service account with the required roles. The generated user will have the following permissions:

  • /roles/cast.gkeAccess (created by script) - access to get / update your GKE cluster and manage compute instances.
  • roles/container.developer - access to resources within the Kubernetes cluster.

The script will perform the following actions:

  • Enables following GCP services and APIs for the project on which GKE cluster is running:
GCP Service / API GroupDescription
serviceusage.googleapis.com{target="_blank"}API to list, enable and disable GCP services
iam.googleapis.com{target="_blank"}API to manage identity and access control for GCP resources
cloudresourcemanager.googleapis.com{target="_blank"}API to create, read, and update metadata for GCP resource containers
container.googleapis.com{target="_blank"}API to manage GKE
compute.googleapis.com{target="_blank"}API to manage GCP virtual machines
  • Creates a dedicated GCP service account castai-gke-<cluster-name-hash> used by CAST AI to request and manage GCP resources on customer's behalf.

  • Creates a custom role castai.gkeAccess with following permissions:

- compute.addresses.use
- compute.disks.create
- compute.disks.setLabels
- compute.disks.use
- compute.images.useReadOnly
- compute.instanceGroupManagers.get
- compute.instanceGroupManagers.update
- compute.instanceGroups.get
- compute.instanceTemplates.create
- compute.instanceTemplates.delete
- compute.instanceTemplates.get
- compute.instanceTemplates.list
- compute.instances.create
- compute.instances.delete
- compute.instances.get
- compute.instances.list
- compute.instances.setLabels
- compute.instances.setMetadata
- compute.instances.setServiceAccount
- compute.instances.setTags
- compute.instances.start
- compute.instances.stop
- compute.networks.use
- compute.networks.useExternalIp
- compute.subnetworks.get
- compute.subnetworks.use
- compute.subnetworks.useExternalIp
- compute.zones.get
- compute.zones.list
- container.certificateSigningRequests.approve
- container.clusters.get
- container.clusters.update
- container.operations.get
- serviceusage.services.list
  • Attaches following roles to castai-gke-<cluster-name-hash> service account:

    Role nameDescription
    castai.gkeAccessCAST AI managed role used to manage CAST AI add/delete node operations, full list of permissions listed above
    container.developerGCP managed role for full access to Kubernetes API objects inside Kubernetes cluster
    iam.serviceAccountUserGCP managed role to allow run operations as the service account
  • Installs Kubernetes components required for a successful experience with CAST AI:

$ kubectl get deployments.apps   -n castai-agent
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
castai-agent                1/1     1            1           15m
castai-agent-cpvpa          1/1     1            1           15m
castai-cluster-controller   2/2     2            2           15m
castai-evictor              0/0     0            0           15m
castai-kvisor               1/1     1            1           15m

$ kubectl get daemonsets.apps -n castai-agent
NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                            AGE
castai-spot-handler    0         0         0       0            0           scheduling.cast.ai/spot=true             15m

Full overview of hosted components can be found here.

GKE node pools created by CAST AI

After cluster is onboarded CAST AI will create two GKE node pools:

  • castpool - is used to gather necessary data required for CAST AI managed GKE x86 nodes creation
  • castpool-arm - is used to gather necessary data required for CAST AI managed GKE ARM64 nodes creation. castpool-arm is created only if the cluster region support ARM64 VMs

AKS

Prerequisites

  • az CLI - A command line tool for working with Azure services using commands in your command-line shell. For more
    information, see Installing az CLI.

  • jq – a lightweight command line JSON processor. For more information about the tool, click here.

  • Azure AD permissions – permissions to create App registration.

Actions performed by the onboarding script

The script will perform the following actions:

  • Create CASTAKSRole-*cluster-id* role used to manged onboarded AKS Cluster with following permissions
ROLE_NAME="CastAKSRole-${CASTAI_CLUSTER_ID:0:8}"
ROLE_DEF='{
   "Name": "'"$ROLE_NAME"'",
   "Description": "CAST.AI role used to manage '"$CLUSTER_NAME"' AKS cluster",
   "IsCustom": true,
   "Actions": [
       "Microsoft.Compute/*/read",
       "Microsoft.Compute/virtualMachines/*",
       "Microsoft.Compute/virtualMachineScaleSets/*",
       "Microsoft.Compute/disks/write",
       "Microsoft.Compute/disks/delete",
       "Microsoft.Compute/disks/beginGetAccess/action",
       "Microsoft.Compute/galleries/write",
       "Microsoft.Compute/galleries/delete",
       "Microsoft.Compute/galleries/images/write",
       "Microsoft.Compute/galleries/images/delete",
       "Microsoft.Compute/galleries/images/versions/write",
       "Microsoft.Compute/galleries/images/versions/delete",
       "Microsoft.Compute/snapshots/write",
       "Microsoft.Compute/snapshots/delete",
       "Microsoft.Network/*/read",
       "Microsoft.Network/networkInterfaces/write",
       "Microsoft.Network/networkInterfaces/delete",
       "Microsoft.Network/networkInterfaces/join/action",
       "Microsoft.Network/networkSecurityGroups/join/action",
       "Microsoft.Network/publicIPAddresses/write",
       "Microsoft.Network/publicIPAddresses/delete",
       "Microsoft.Network/publicIPAddresses/join/action",
       "Microsoft.Network/virtualNetworks/subnets/join/action",
       "Microsoft.Network/virtualNetworks/subnets/write",
       "Microsoft.Network/applicationGateways/backendhealth/action",
       "Microsoft.Network/applicationGateways/backendAddressPools/join/action",
       "Microsoft.Network/applicationSecurityGroups/joinIpConfiguration/action",
       "Microsoft.Network/loadBalancers/backendAddressPools/write",
       "Microsoft.Network/loadBalancers/backendAddressPools/join/action",
       "Microsoft.ContainerService/*/read",
       "Microsoft.ContainerService/managedClusters/start/action",
       "Microsoft.ContainerService/managedClusters/stop/action",
       "Microsoft.ContainerService/managedClusters/runCommand/action",
       "Microsoft.ContainerService/managedClusters/agentPools/*",
       "Microsoft.Resources/*/read",
       "Microsoft.Resources/tags/write",
       "Microsoft.Authorization/locks/read",
       "Microsoft.Authorization/roleAssignments/read",
       "Microsoft.Authorization/roleDefinitions/read",
       "Microsoft.ManagedIdentity/userAssignedIdentities/assign/action"
     ],
     "AssignableScopes": [
       "/subscriptions/'"$SUBSCRIPTION_ID"'/resourceGroups/'"$CLUSTER_GROUP"'",
       "/subscriptions/'"$SUBSCRIPTION_ID"'/resourceGroups/'"$NODE_GROUP"'"
     ]
}'
  • Create app registration CAST.AI ${CLUSTER_NAME}-${CASTAI_CLUSTER_ID:0:8}" which uses the role CastAKSRole-${CASTAI_CLUSTER_ID:0:8}

📘

All the Write permissions are scoped to a resource groups in which the cluster is running - it won't have access to resources of any other clusters in the Azure subscription.

Kubernetes components required for a successful experience with CAST AI

$ kubectl get deployments.apps   -n castai-agent
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
castai-agent                1/1     1            1           3h26m
castai-agent-cpvpa          1/1     1            1           3h26m
castai-cluster-controller   2/2     2            2           3h26m
castai-evictor              0/0     0            0           3h26m
$ kubectl get daemonsets.apps -n castai-agent
NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                            AGE
castai-aks-init-data   0         0         0       0            0           provisioner.cast.ai/aks-init-data=true   3h26m
castai-spot-handler    0         0         0       0            0           scheduling.cast.ai/spot=true             3h26m

Full overview of hosted components can be found here.

Azure Agent Pools created by CAST AI

After cluster is onboarded CAST AI will create two AKS agent pools :

  • castpool - is used to run aks-init-data DaemonSet to gather necessary data required for CAST AI managed AKS nodes creation. More on aks-init-data DaemonSet can be found here.
  • castworkers - is used as a container for CAST AI managed AKS nodes. Removal of this agent pool would result in removal of all CAST AI created nodes.