Hosted model deployment

Model deployments

The Model deployments feature allows you to deploy and manage AI models directly in your Cast AI-connected Kubernetes cluster. This guide explains how to set up and use model deployments with Cast AI.

📘

Note

Currently, Cast AI supports deploying a subset of Ollama models. The available models depend on your cluster's region and GPU availability.

Prerequisites

Before deploying models, ensure your cluster meets these requirements:

  1. The cluster must be connected to Cast AI and running in automated optimization mode (Phase 2)
  2. GPU drivers must be installed on your cluster with the correct tolerations:
  3. Your GPU daemonset must include this required toleration:
    tolerations:
    - key: "scheduling.cast.ai/node-template"
      operator: "Exists"
    

You can apply the above toleration to the appropriate daemonset using the following command:

kubectl patch daemonset <daemonset-name> -n <daemonset-namespace> --type=json -p='[{"op": "add", "path": "/spec/template/spec/tolerations", "value": [{"key": "scheduling.cast.ai/node-template", "operator": "Exists"}]}]'

Setting up a cluster for model deployments

  1. Navigate to AI Enabler > Model Deployments in the Cast AI console.
  2. Click Install AI Enabler.
  3. Select your cluster from the list.

📘

Note

Only eligible clusters will appear in this list.

  1. Run the provided script in your terminal or cloud shell.

  1. Wait for the installation to complete.

Deploying a model

Once the AI Enabler is installed, you can deploy models to your cluster:

📘

Note

When you deploy a model, Cast AI automatically enables the Unscheduled pods policy if it is currently disabled. The policy will only affect model deployments since Cast AI activates just the node template for hosted models while keeping all other templates disabled.

  1. Select your cluster.
  2. Choose a supported model from the list.

📘

Note about GPU availability

Some models require specific GPU types. For example, 70b models need A100 GPUs. If the required GPU type is not available in your cluster's region, the model won't be shown as an available option.

  1. Configure the deployment:
    1. Specify a service name and port for accessing the deployed model within the cluster.
    2. Select an existing node template or let Cast AI create a custom template with the recommended configuration.

  1. Click Deploy to start the deployment.

The model will be deployed in the castai-llms namespace. You can monitor the deployment progress on the Model Deployments page. Once the deployment is finished, the model status will change from Deploying to Running.

Alternatively, you can monitor the deployment progress using the following command in your cloud shell or terminal:

kubectl get pod -n castai-llms

📘

Note

Model deployment may take up to 25 minutes.

Supported models

Cast AI supports various hosted models. The available models depend on several factors:

  • Your cluster's region
  • Available GPU types in that region
  • Model routing capabilities

To get a current list of supported models and their specifications, including pricing, GPU requirements, and routing capability, use the List hosted model specs API endpoint:

GET /ai-optimizer/v1beta/hosted-model-specs

📘

Note

Model routing capability determines whether the model can be used by the AI Enabler's smart routing feature. Models marked as non-routable can still be deployed and used directly, but won't be included in automatic routing decisions.

Using deployed models

The deployed models are accessible through the AI Enabler Proxy, which runs in the castai-agent namespace as the castai-ai-optimizer-proxy component. You can access the router in several ways, detailed below.

From within the cluster

Use the following endpoint:

http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443

Example request:

curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: {CASTAI_API_KEY}' \
-X POST -d '{
  "model": "gpt-4-0125-preview",
  "messages": [
    {
      "role": "user",
      "content": "What kind of instance types to use in GCP for running an AI training model?"
    }
  ]
}'

From your local machine

You can access the models by port forwarding to the service, for example:

# Port forward to the proxy service
kubectl port-forward svc/castai-ai-optimizer-proxy 8080:443 -n castai-agent

# Make requests to the forwarded port
curl http://localhost:8080/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: {CASTAI_API_KEY}' \
-X POST -d '{
  "model": "gpt-4-0125-preview",
  "messages": [
    {
      "role": "user", 
      "content": "What kind of instance types to use in GCP for running an AI training model?"
    }
  ]
}'

The deployed models will be automatically included in the AI Enabler's routing decisions along with any other providers you have registered.