Getting started
The AI Enabler Proxy acts as a gateway for external providers. This guide provides instructions on how to configure and use the AI Enabler Proxy.
AI Enabler Proxy is a feature that allows you to route requests to different external Large Language Model (LLM) providers.
You can run the AI Enabler Proxy in your Kubernetes cluster or use the one on the Cast AI platform. In both cases, the Proxy expects the request to follow the OpenAI API contract described in the OpenAI API Reference documentation. The response will also follow the OpenAI API contract.
The only supported endpoint is the/openai/v1/chat/completions, which mimics OpenAI's /v1/chat/completions endpoint.
Installation options
Choose from three installation methods based on your infrastructure management approach:
Console installation
Install AI Enabler through the Cast AI console interface:
- Navigate to AI Enabler > Model Deployments in the Cast AI console
- Click Install AI Enabler
- Select your cluster from the list (only eligible clusters will appear)
- Run the provided script in your terminal or cloud shell
- Wait for the installation to complete
NoteOnly clusters connected to Cast AI and running in automated optimization mode (Phase 2) will appear in the eligible clusters list.
Terraform installation
AI Enabler can be installed automatically using our official Terraform modules. Set the install_ai_optimizer variable to true in your configuration.
EKS example
module "castai-eks-cluster" {
source = "castai/eks-cluster/castai"
aws_account_id = var.aws_account_id
aws_cluster_region = var.aws_cluster_region
aws_cluster_name = var.aws_cluster_name
castai_api_token = var.castai_api_token
# Enable AI Enabler
install_ai_optimizer = true
# Other configuration...
}Available Terraform modules:
Helm installation
For direct installation without using Terraform modules or the console:
# Add the Cast AI Helm repository
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
# Install the AI Enabler Proxy
helm upgrade -i castai-ai-optimizer-proxy castai-helm/castai-ai-optimizer-proxy -n castai-agent \
--set castai.apiKey=<CASTAI_API_TOKEN> \
--set castai.clusterID=<CASTAI_CLUSTER_ID> \
--set castai.apiURL=<API_URL> \
--set createNamespace=trueReplace the environment variables with your actual values:
<CASTAI_API_TOKEN>: Your Cast AI API key<CASTAI_CLUSTER_ID>: Your Cast AI cluster ID<API_URL>: Cast AI API URL (https://api.cast.aiorhttps://api.eu.cast.ai)
Streaming
The API fully supports both streaming and non-streaming responses.
To enable streaming, add "stream": true to your request body. When streaming is enabled, you'll receive the response as a data stream, following the same format as OpenAI's streaming responses.
Example request with streaming enabled:
curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
],
"stream": true
}'Supported providers
You can find the list of supported LLM providers and their supported models here. Cast AI can proxy requests to any provider and model combination from this list once they are registered.
Model quantization and precision
When working with Large Language Models (LLMs), model quantization - the process of reducing model precision to decrease memory usage and increase inference speed - plays an important role in balancing performance and resource utilization.
Understanding model precision
Models can be quantized to different precision levels:
- 16-bit (FP16): Full precision, offering the highest accuracy
- 8-bit (INT8): Reduced precision with good accuracy-performance balance
- 4-bit (INT4): Lowest supported precision, maximizing performance at the cost of some accuracy
When deploying self-hosted models, AI Enabler uses optimized quantization (typically 4-bit or 8-bit) by default to balance performance and resource usage.
Viewing model quantization
The AI Enabler /ai-optimizer/v1beta/hosted-model-specs API endpoint returns the quantization format for each model using the GGUF standard (e.g., Q8_0, Q4_K_M). The API response lets you view the specific quantization being used for any model. For example:
{
"items": [
{
"model": "llama3.1:8b",
"description": "Llama 3.1 8B is a compact 8 billion parameter model balancing performance and efficiency. It features a 128K token context window, multilingual support, and optimized low-latency inference. Ideal for startups and mobile apps, it handles content generation, summarization, and basic language tasks effectively.",
"cpu": 6,
"memoryMib": 16384,
"provider": "ollama",
"tokensPerSecond": 50,
"createTime": "2024-11-08T12:36:45.258213Z",
"quantization": "Q4_K_M",
"regions": [
{
"name": "us-west1",
"pricePerHour": "0.24082",
"cloud": "GCP",
"instanceType": "n1-standard-8",
"gpuCount": 1,
"gpuName": "nvidia-tesla-t4"
}
]
}
]
}Register LLM providers
To enable the AI Enabler Proxy to route your requests to the appropriate LLM provider, you must register the providers you want to use (e.g., OpenAI, Gemini, Groq, Azure).
To register the LLM providers, make a POST request to the relevant Cast AI API endpoint. Below is an example of OpenAI, Azure, Gemini, and VertexAI providers being registered, specifying authentication, available models, and provider-specific parameters.
curl https://api.cast.ai/v1/llm/providers \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \
-X POST -d '{
"providers": [
{
"name": "openai-gpt3.5",
"supportedProvider": "OPENAI",
"apiKey": "<openai-api-key-1>",
"models": ["gpt-3.5-turbo-0125"]
},
{
"name": "openai-gpt4+",
"supportedProvider": "OPENAI",
"apiKey": "<openai-api-key-2>",
"models": ["gpt-4o-2024-05-13", "gpt-4-0613"]
},
// Azure OpenAI configuration
{
"name": "azure-provider",
"supportedProvider": "AZURE",
"url": "https://something-azure-openai.openai.azure.com",
"apiKey": "<azure-api-key>",
"apiVersion": "2024-02-01",
"models": ["gpt-3.5-turbo-0125", "gpt-3.5-turbo-0301", "gpt-4o"],
"isHosted": true
},
// Google's Gemini API configuration
{
"name": "gemini-api-provider",
"supportedProvider": "GEMINI",
"apiKey": "<gemini-api-key>",
"models": ["gemini-1.5-flash", "gemini-1.5-pro"]
},
// Google Cloud Vertex AI Gemini configuration
{
"name": "vertex-ai-gemini-provider",
"supportedProvider": "VERTEXAIGEMINI",
"apiKey": "<gcloud-access-token>",
"models": ["gemini-1.5-flash", "gemini-1.5-pro"],
"url": "https://us-central1-aiplatform.googleapis.com/v1/projects/some-project/locations/us-central1",
"isHosted": true
}
]
}'- Replace
$CASTAI_API_KEYwith your actual Cast AI API key, and<api_key>with the API key for the provider you are registering. - Modify the
supportedProviderfield to match the provider you are registering. - Specify the models you want to use for each provider in the
modelsarray. - The
isHostedfield specifies whether the LLM Provider is hosted on the user side and should be picked over the non-hosted ones.
Note that you may register a single Provider multiple times. For instance, you can have an OpenAI Provider per OpenAI API Key to limit the models that can be used by each API Key.
NoteThe Provider API Keys are not stored on the Cast AI side. They are securely stored in a Secret Vault and accessed only when proxying requests. Cast AI stores only the last 4 characters of each used API Key for reporting purposes.
Make requests to the Proxy
To start making requests to the AI Enabler Proxy running on the Cast AI platform, follow these steps:
- Generate an API Access Key from your Cast AI account.
- Include the API Access Key in the
X-API-Keyheader or theAuthorizationheader with theBearerschema when making requests to the Proxy endpoint. - Make a POST request to the Proxy endpoint with the desired payload:
curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
]
}'curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \ \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
]
}'Modify the request payload as needed, following the OpenAI API Reference documentation.
NoteTo attach custom metadata for usage tracking and cost analysis, add tags to your requests. See Tags for details.
Supported endpoints
Different tools and integrations may require different base URLs for the AI Enabler Proxy. The default endpoint works with most standard OpenAI SDK implementations and tools like Azure Prompt Flow. Here's a table of known endpoint requirements:
| Tool/Integration | Base URL | Notes |
|---|---|---|
| Default | https://llm.cast.ai/openai/v1/chat/completions | Use for OpenAI SDK, Azure Prompt Flow, and similar tools |
| LangChain | https://llm.cast.ai/openai/v1 | Required for LangChain integration |
| MemGPT | https://llm.cast.ai/openai | Required for MemGPT integration |
If you use a tool or SDK not listed here and encounter connectivity issues, try the default endpoint first. For tools requiring a different endpoint configuration, contact our team on the Slack community channel or Cast AI support.
We regularly update this list as we verify endpoint requirements for different tools and SDKs.
Run the AI Enabler Proxy in-cluster
If you prefer to run the AI Enabler Proxy in your own Kubernetes cluster, follow these steps:
- Install the AI Enabler Proxy using Helm:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo updatehelm upgrade -i castai-ai-optimizer-proxy castai-helm/castai-ai-optimizer-proxy -n castai-agent \
--set castai.apiKey=$CASTAI_API_TOKEN \
--set castai.clusterID=$CASTAI_CLUSTER_ID \
--set castai.apiURL=https://api.cast.aiReplace the following values:
| Value | Description |
|---|---|
CASTAI_API_TOKEN | Your Cast AI API key |
CASTAI_CLUSTER_ID | The ID of your Kubernetes cluster |
The in-cluster proxy supports the following API key configuration:
| Helm value | Required | Description |
|---|---|---|
castai.apiKey | Yes | Used for internal proxy-to-SaaS communication, including sending telemetry and logs to Cast AI. You can also use castai.apiKeySecretRef to reference a Kubernetes secret. |
castai.apiKeyFallback | No | Enables auth-less proxy requests. When set, incoming requests without an API key header use this fallback key for authentication. Without it, requests missing an API key return a 401 Unauthorized error. |
- Make requests to the in-cluster proxy endpoint:
curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What instance types should I use for AI training?"
}
]
}'Viewing your generative AI savings report
After setting up the AI Enabler Proxy, you'll want to see how many tokens you're using and what it costs you. The Analytics report becomes available once you make requests through the proxy.
Requirements for the savings report
To see your savings data:
- Ensure you have properly registered your LLM providers
- Make at least a few successful requests through the proxy
- Wait a short time for the data to be processed (usually just a few minutes)
The report will automatically appear in your Cast AI console once there is actual usage data to analyze. This helps ensure the savings calculations are based on real traffic patterns rather than estimates.
Updated 6 days ago
