GPU

Are there any documents that explain how the cost of GPU is calculated?

GPU cost calculation is still under investigation, and we will share our results soon.

You can find more information about CPU vs. memory cost calculation here.

GPU-optimized instance autoscaling is only supported at the moment by AWS EKS Nvidia instances, with more to come soon. Learn more about GPU support here.

Is there a known or best-practice way to keep an extra instance available for even faster pod scheduling? CAST AI picks up new requests quickly, but there’s still a 3-minute instance startup time in EC2. I’m looking for a way to always have an extra 1 GPU hot, but nothing scheduled - so if a new GPU pod request comes in, it can be scheduled right away, avoiding the startup time.

The best way to achieve this is by creating a dummy pod that has GPU limits and a low priorityClass. That way, your new pods will evict that dummy pod out and take its place. Once it gets kicked out, CAST AI will create a new node to give extra capacity.

Learn more in Kubernetes docs on Pod Priority and Preemption.

We run a lot of video rendering jobs using Argo workflows which means a lot of elasticity in pod/node count. It also requires us to be very GPU-heavy. The recommended sizes don't seem to account for that. Are there any future plans to address more GPU-intensive infrastructures?

CAST AI in read/write mode will automatically scale up and down based on the workload. If you have workloads that require GPUs, we can put them in a node template with the GPU families needed to scale correctly.

The Available Savings report is a snapshot in time, and it's also a more simplified version of the full autoscaler.

I get this error from CAST AI, how do I deal with it?

Error:

"error": {
"details": "",
"message":
"finding GPU attached instance image for "cluster-api-ubuntu-2204-v1-27.*nvda"",
"reason": ReasonUnknown"
},

CAST AI doesn't support the provisioning of GPU Nodes with Custom Images, only Google Cloud default ones at the moment. If you require that, the image naming should be based on GCP Nvidia images, and the image name in their project should follow the pattern below:

// for images that have version in name, e.g.:  
    // gke-1259-gke2300-cos-101-17162-210-12-v230516-c-nvda,  
    // gke-1259-gke2300-cos-101-17162-210-12-v230516-c-pre,  
    // gke-1251-gke500-cos-101-17162-40-1-v220921-pre,  
    // gke-1251-gke500-cos-101-17162-40-1-v220921-pre-nvda,.  
// for images that don't have version in name, e.g.:  
    // gke-12510-gke1200-cos-101-17162-210-18-c-cgpv1-nvda,  
    // gke-12510-gke1200-cos-101-17162-210-18-c-cgpv1-pre,  
    // gke-12510-gke1200-cos-101-17162-210-18-c-nvda,  
    // gke-12510-gke1200-cos-101-17162-210-18-c-pre,.

How does CAST AI detect if the Nvidia driver has been deployed?

Detection is done by the DaemonSet name or label on DaemonSet. Learn more about GPU driver installation here.