Terraform

Will the CAST AI provider remove the deprecated fields in version 4.0 and nothing else major that requires us to rework our provider usage?

Yes, that is correct.

Can one import the nodeTemplate state to the current Terraform state?

This question may concern you if you made some updates via the CAST AI console UI and when tried to redeploy, you see them get changed or destroyed.

The Terraform script needs to be changed and you need to add those fields. The Terraform state import just connects the Terraform resource to the node template - by importing, it just knows what object to change when applying changes (not how to apply them).

Our recommendation is to add the changes to a given module, run the Terraform plan, and repeat until there are no changes in the plan.

Alternatively, you can remove the module from the script and remove it from the Terraform state. This will make Terraform stop tracking the node template and it can be managed only from UI.

What is the limitation on the API token used for Terraform onboarding? In my cluster managed via Terraform I've used both a full access API key and the agent's API key for the cluster controller and didn't encounter any errors.

We created the token in the UI (onboarding script) for a specific cluster, bound to a specific onboarding; this token is tied to that cluster. The API key (created via UI) is not bound to a cluster. We don’t recommend sharing it between clusters, but it can be done.

If you reuse to token from the UI, you will face issues. The token from the UI expires after 4 hours and is also deleted when you delete a cluster.

For long-term use of clusters managed via Terraform, you need to use an API Token with Full Access created in the CAST AI UI.

How do we disconnect a cluster from CAST AI if we onboarded it using Terraform?

You can destroy CAST AIresources using Terraform as this is how the cluster was initially deployed. Also please ensure delete_nodes_on_disconnect-variable is set to true if you want to delete CAST AI-created resources on deletion.

What is your recommendation to move from version 3 to 5? I would like to know what impact each component will have.

We recommend that you follow the outlined steps for migrating from version 3.x.x to 5.x.x:

Follow these steps to migrate from version 3.x.x to 4.x.x.
Once you've completed the above step, proceed to follow the migration steps from version 4.x.x to 5.x.x.- here's a detailed instruction.
Lastly, make sure to add the parameters spot_interruption_predictions_enabled and spot_interruption_predictions_type to the castai_node_templateto enable Spot interruption predictions.

The following are moved from autoscaler_policies_json property with source and targets:

nodeConstraints.minCpuCores -> castai_node_template (min_cpu)
nodeConstraints.maxCpuCores -> castai_node_template (max_cpu)
nodeConstraints.minRamMib  -> castai_node_template (min_memory)
nodeConstraints.maxRamMib -> castai_node_template (max_memory)
customInstancesEnabled -> castai_node_template (custom_instances_enabled)
spotInstances.Enabled -> castai_node_template (spot)
spotInstances.spotDiversityEnabled -> castai_node_template (enable_spot_diversity) 
spotInstances.spotDiversityPriceIncreaseLimitPercent -> castai_node_template (spot_diversity_price_increase_limit_percent)
spotInstances.spotInterruptionPredictions -> Moved to castai_node_template spotInstances.spot_interruption_predictions_enabled)

Why do I get this error when running Terraform for a new cluster with same scheduled rebalancing and job?

Error:

│ Error: expected status code 200, received: status=404 body={"message":"Job not found","fieldViolations":\[]}  
│  
│   with module.castai-terraform.castai_rebalancing_job.spots,  
│   on .terraform/modules/castai-terraform/castai.tf line 117, in resource "castai_rebalancing_job" "spots":  
│  117: resource "castai_rebalancing_job" "spots" {  
│  
╵Error: Error returned by 'terraform plan': Failed to run terraform: exit status 1

Currently, a cluster must be created with its own rebalance schedule and rebalance job ID. CAST AI doesn't support this via Terraform at the moment.

What steps should I follow to sequentially migrate my CAST installation to Terraform (IaC)?

Disable autoscaling in Cast AI.
Activate non-Cast node pools.
Provision nodes from the cloud provider for these non-Cast node pools.
Sequentially delete all nodes from each node template, handling one node template at a time.
Verify that all Cast nodes are deleted, ensuring only the cloud provider nodes remain operational.
Finally, disconnect Cast AI from the cluster.