Hosted model deployment
Model deployments
The Model deployments feature allows you to deploy and manage AI models directly in your Cast AI-connected Kubernetes cluster. This guide explains how to set up and use model deployments with Cast AI.
Note
Currently, Cast AI supports deploying a subset of Ollama models. The available models depend on your cluster's region and GPU availability.
Prerequisites
Before deploying models, ensure your cluster meets these requirements:
- The cluster must be connected to Cast AI and running in automated optimization mode (Phase 2)
- GPU drivers must be installed on your cluster with the correct tolerations:
- Follow the GPU driver installation guide to check if your CSP provides drivers or if you need to set them up (if they are not installed).
- Your GPU
daemonset
must include this required toleration:tolerations: - key: "scheduling.cast.ai/node-template" operator: "Exists"
You can apply the above toleration to the appropriate daemonset
using the following command:
kubectl patch daemonset <daemonset-name> -n <daemonset-namespace> --type=json -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "scheduling.cast.ai/node-template", "operator": "Exists"}]}]'
Setting up a cluster for model deployments
- Navigate to AI Enabler > Model Deployments in the Cast AI console.
- Click Install AI Enabler.
- Select your cluster from the list.
Note
Only eligible clusters will appear in this list.
- Run the provided script in your terminal or cloud shell.
- Wait for the installation to complete.
Deploying a model
Once the AI Enabler is installed, you can deploy models to your cluster:
Note
When you deploy a model, Cast AI automatically enables the Unscheduled pods policy if it is currently disabled. The policy will only affect model deployments since Cast AI activates just the node template for hosted models while keeping all other templates disabled.
- Select your cluster.
- Choose a supported model from the list.
Note about GPU availability
Some models require specific GPU types. For example, 70b models need A100 GPUs. If the required GPU type is not available in your cluster's region, the model won't be shown as an available option.
- Configure the deployment:
- Specify a service name and port for accessing the deployed model within the cluster.
- Select an existing node template or let Cast AI create a custom template with the recommended configuration.
- Click Deploy to start the deployment.
The model will be deployed in the castai-llms
namespace. You can monitor the deployment progress on the Model Deployments page. Once the deployment is finished, the model status will change from Deploying to Running.
Alternatively, you can monitor the deployment progress using the following command in your cloud shell or terminal:
kubectl get pod -n castai-llms
Note
Model deployment may take up to 25 minutes.
Supported models
Cast AI supports various hosted models. The available models depend on several factors:
- Your cluster's region
- Available GPU types in that region
- Model routing capabilities
To get a current list of supported models and their specifications, including pricing, GPU requirements, and routing capability, use the List hosted model specs API endpoint:
GET /ai-optimizer/v1beta/hosted-model-specs
Note
Model routing capability determines whether the model can be used by the AI Enabler's smart routing feature. Models marked as non-routable can still be deployed and used directly, but won't be included in automatic routing decisions.
Using deployed models
The deployed models are accessible through the AI Enabler Proxy, which runs in the castai-agent
namespace as the castai-ai-optimizer-proxy
component. You can access the router in several ways, detailed below.
From within the cluster
Use the following endpoint:
http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443
Example request:
curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: {CASTAI_API_KEY}' \
-X POST -d '{
"model": "gpt-4-0125-preview",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
]
}'
From your local machine
You can access the models by port forwarding to the service, for example:
# Port forward to the proxy service
kubectl port-forward svc/castai-ai-optimizer-proxy 8080:443 -n castai-agent
# Make requests to the forwarded port
curl http://localhost:8080/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: {CASTAI_API_KEY}' \
-X POST -d '{
"model": "gpt-4-0125-preview",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
]
}'
The deployed models will be automatically included in the AI Enabler's routing decisions along with any other providers you have registered.
Updated 5 days ago