Getting started
The AI Enabler Proxy allows you to route requests to the best and cheapest Large Language Model (LLM). This guide provides instructions on how to configure and use the AI Enabler Proxy.
Early Access Feature
This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.
AI Enabler Proxy is a feature that allows you to route requests to different Large Language Model (LLM) providers based on complexity and associated cost.
You can run the AI Enabler Proxy in your Kubernetes cluster or use the one on the Cast AI platform. In both cases, the Proxy expects the request to follow the OpenAI API contract described in the OpenAI API Reference documentation. The response will also follow the OpenAI API contract.
The only supported endpoint is the/openai/v1/chat/completions
, which mimics the OpenAI's /v1/chat/completions
endpoint.
Streaming
The API fully supports both streaming and non-streaming responses.
To enable streaming, add "stream": true
to your request body. When streaming is enabled, you'll receive the response as a data stream, following the same format as OpenAI's streaming responses.
Example request with streaming enabled:
curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
],
"stream": true
}'
Supported providers
You can find the list of supported LLM providers and their supported models here. Cast AI can proxy requests to any provider and model combination from this list once they are registered.
Model quantization and precision
When working with Large Language Models (LLMs), model quantization - the process of reducing model precision to decrease memory usage and increase inference speed - plays an important role in balancing performance and resource utilization.
Understanding model precision
Models can be quantized to different precision levels:
- 16-bit (FP16): Full precision, offering the highest accuracy
- 8-bit (INT8): Reduced precision with good accuracy-performance balance
- 4-bit (INT4): Lowest supported precision, maximizing performance at the cost of some accuracy
The AI Enabler uses different quantization levels for different scenarios:
- Routing/Recommendations: Recommendations are based on full precision (16-bit) model performance to ensure the highest accuracy in model selection.
- Self-hosted deployment: When deploying models, they use optimized quantization (typically 4-bit or 8-bit) by default to balance performance and resource usage.
Viewing model quantization
The AI Enabler /ai-optimizer/v1beta/hosted-model-specs
API endpoint returns the quantization format for each model using the GGUF standard (e.g., Q8_0
, Q4_K_M
). The API response lets you view the specific quantization being used for any model. For example:
{
"items": [
{
"model": "llama3.1:8b",
"description": "Llama 3.1 8B is a compact 8 billion parameter model balancing performance and efficiency. It features a 128K token context window, multilingual support, and optimized low-latency inference. Ideal for startups and mobile apps, it handles content generation, summarization, and basic language tasks effectively.",
"cpu": 6,
"memoryMib": 16384,
"provider": "ollama",
"tokensPerSecond": 50,
"createTime": "2024-11-08T12:36:45.258213Z",
"routable": true,
"quantization": "Q4_K_M",
"regions": [
{
"name": "us-west1",
"pricePerHour": "0.24082",
"cloud": "GCP",
"instanceType": "n1-standard-8",
"gpuCount": 1,
"gpuName": "nvidia-tesla-t4"
}
]
}
]
}
Note
When using the routing capabilities, be aware that while recommendations are based on full precision models, the actual deployed models may use lower precision quantization for optimal performance. Consider this difference when evaluating model performance against recommendations.
Register LLM providers
To enable the AI Enabler Proxy to route your requests to the appropriate LLM provider, you must register the providers you want to use (e.g., OpenAI, Gemini, Groq, Azure).
To register the LLM providers, make a POST
request to the relevant Cast AI API endpoint. Below is an example of OpenAI, Azure, Gemini, and VertexAI providers being registered, specifying authentication, available models, and provider-specific parameters.
curl https://api.cast.ai/v1/llm/providers \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \
-X POST -d '{
"providers": [
{
"name": "openai-gpt3.5",
"supportedProvider": "OPENAI",
"apiKey": "<openai-api-key-1>",
"models": ["gpt-3.5-turbo-0125"]
},
{
"name": "openai-gpt4+",
"supportedProvider": "OPENAI",
"apiKey": "<openai-api-key-2>",
"models": ["gpt-4o-2024-05-13", "gpt-4-0613"]
},
// Azure OpenAI configuration
{
"name": "azure-provider",
"supportedProvider": "AZURE",
"url": "https://something-azure-openai.openai.azure.com",
"apiKey": "<azure-api-key>",
"apiVersion": "2024-02-01",
"models": ["gpt-3.5-turbo-0125", "gpt-3.5-turbo-0301", "gpt-4o"],
"isHosted": true
},
// Google's Gemini API configuration
{
"name": "gemini-api-provider",
"supportedProvider": "GEMINI",
"apiKey": "<gemini-api-key>",
"models": ["gemini-1.5-flash", "gemini-1.5-pro"]
},
// Google Cloud Vertex AI Gemini configuration
{
"name": "vertex-ai-gemini-provider",
"supportedProvider": "VERTEXAIGEMINI",
"apiKey": "<gcloud-access-token>",
"models": ["gemini-1.5-flash", "gemini-1.5-pro"],
"url": "https://us-central1-aiplatform.googleapis.com/v1/projects/some-project/locations/us-central1",
"isHosted": true
}
]
}'
- Replace
$CASTAI_API_KEY
with your actual Cast AI API key, and<api_key>
with the API key for the provider you are registering. - Modify the
supportedProvider
field to match the provider you are registering. - Specify the models you want to use for each provider in the
models
array. - The
isHosted
field specifies whether the LLM Provider is hosted on the user side and should be picked over the non-hosted ones.
Note that you may register a single Provider multiple times. For instance, you can have an OpenAI Provider per OpenAI API Key to limit the models that can be used by each API Key.
Note
The Provider API Keys are not stored on the CAST AI side. They are securely stored in a Secret Vault and accessed only when proxying/routing requests. CAST AI stores only the last 4 characters of each used API Key for reporting purposes.
Configure the Proxy
To configure the Proxy's behavior, such as enabling request routing and prompt sharing, follow these steps:
- Make a
PUT
request to the Cast AI API endpoint for updating proxy settings:
curl https://api.cast.ai/v1/llm/settings \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \
-X PUT -d '{"promptSharingEnabled": true, "routingEnabled": true, "apiKey": <cast-ai-api-key>}'
- Set
promptSharingEnabled
totrue
for Cast AI to store the prompts and allow you to provide feedback on prompt categorization and response quality. This feedback is used to improve the Proxy's decision-making. - Set
routingEnabled
totrue
to enable request routing to the registered providers. If set tofalse
, requests can only be proxied to OpenAI. No other Provider is supported for proxying. - (Optional) Set the
apiKey
to the Cast AI API Key, which should have these settings configured. IfapiKey
is unset, the settings will be organization-wide.
Make requests to the Proxy
To start making requests to the AI Enabler Proxy running on the Cast AI platform, follow these steps:
- Generate an API Access Key from your Cast AI account.
- Include the API Access Key in the
X-API-Key
header or theAuthorization
header with theBearer
schema when making requests to the Proxy endpoint. - Make a POST request to the Proxy endpoint with the desired payload:
curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
]
}'
curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \ \
-X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "What kind of instance types to use in GCP for running an AI training model?"
}
]
}'
Modify the request payload as needed, following the OpenAI API Reference documentation.
Note
You can specify any model that you've registered, and Cast AI will route the request to the appropriate provider.
Supported endpoints
Different tools and integrations may require different base URLs for the AI Enabler Proxy. The default endpoint works with most standard OpenAI SDK implementations and tools like Azure Prompt Flow. Here's a table of known endpoint requirements:
Tool/Integration | Base URL | Notes |
---|---|---|
Default | https://llm.cast.ai/openai/v1/chat/completions | Use for OpenAI SDK, Azure Prompt Flow, and similar tools |
LangChain | https://llm.cast.ai/openai/v1 | Required for LangChain integration |
MemGPT | https://llm.cast.ai/openai | Required for MemGPT integration |
If you use a tool or SDK not listed here and encounter connectivity issues, try the default endpoint first. For tools requiring a different endpoint configuration, contact our team on the Slack community channel or Cast AI support.
We regularly update this list as we verify endpoint requirements for different tools and SDKs.
Run the AI Enabler Proxy in-cluster
If you prefer to run the AI Enabler Proxy in your own Kubernetes cluster, follow these steps:
- Install the AI Enabler Proxy using Helm:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-ai-optimizer-proxy castai-helm/castai-ai-optimizer-proxy \
-n castai-agent --create-namespace \
--set castai.apiKey=<CASTAI_API_KEY>,castai.clusterID=<CLUSTER_ID>,castai.apiURL=https://api.cast.ai
Replace <CASTAI_API_KEY>
with your actual Cast AI API key and <CLUSTER_ID>
with the ID of your Kubernetes cluster.
- Make requests to the in-cluster Proxy endpoint. The requests to the proxy are the same, except that you no longer need to provide any authorization header with the Cast AI API Key. If you have a pod running in the same cluster, you can access the Proxy like so:
curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-v -X POST -d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "How to use golang generics?"
}
]
}'
Ensure you have registered the providers and adjusted the proxy settings on the Cast AI platform as described in prior sections.
Viewing your generative AI savings report
After setting up the AI Enabler Proxy, you'll want to see how much you're saving by using Cast AI's intelligent routing. The generative AI savings report becomes available once you make requests through the proxy.
Requirements for the savings report
To see your savings data:
- Ensure you have properly registered your LLM providers
- Make at least a few successful requests through the proxy
- Wait a short time for the data to be processed (usually just a few minutes)
The report will automatically appear in your Cast AI console once there is actual usage data to analyze. This helps ensure the savings calculations are based on real traffic patterns rather than estimates.
Updated 6 days ago