📣
Early Access Feature
This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

AI Enabler Proxy is a feature that allows you to route requests to different Large Language Model (LLM) providers based on complexity and associated cost.

You can run the AI Enabler Proxy in your Kubernetes cluster or use the one on the Cast AI platform. In both cases, the Proxy expects the request to follow the OpenAI API contract described in the OpenAI API Reference documentation. The response will also follow the OpenAI API contract.

The only supported endpoint is the/openai/v1/chat/completions, which mimics the OpenAI's /v1/chat/completions endpoint.

Streaming

The API fully supports both streaming and non-streaming responses.

To enable streaming, add "stream": true to your request body. When streaming is enabled, you'll receive the response as a data stream, following the same format as OpenAI's streaming responses.

Example request with streaming enabled:

curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "What kind of instance types to use in GCP for running an AI training model?"
    }
  ],
  "stream": true
}'

Supported providers

You can find the list of supported LLM providers and their supported models here. Cast AI can proxy requests to any provider and model combination from this list once they are registered.

Model quantization and precision

When working with Large Language Models (LLMs), model quantization - the process of reducing model precision to decrease memory usage and increase inference speed - plays an important role in balancing performance and resource utilization.

Understanding model precision

Models can be quantized to different precision levels:

16-bit (FP16): Full precision, offering the highest accuracy
8-bit (INT8): Reduced precision with good accuracy-performance balance
4-bit (INT4): Lowest supported precision, maximizing performance at the cost of some accuracy

The AI Enabler uses different quantization levels for different scenarios:

Routing/Recommendations: Recommendations are based on full precision (16-bit) model performance to ensure the highest accuracy in model selection.
Self-hosted deployment: When deploying models, they use optimized quantization (typically 4-bit or 8-bit) by default to balance performance and resource usage.

Viewing model quantization

The AI Enabler /ai-optimizer/v1beta/hosted-model-specs API endpoint returns the quantization format for each model using the GGUF standard (e.g., Q8_0, Q4_K_M). The API response lets you view the specific quantization being used for any model. For example:

{
  "items": [
    {
      "model": "llama3.1:8b",
      "description": "Llama 3.1 8B is a compact 8 billion parameter model balancing performance and efficiency. It features a 128K token context window, multilingual support, and optimized low-latency inference. Ideal for startups and mobile apps, it handles content generation, summarization, and basic language tasks effectively.",
      "cpu": 6,
      "memoryMib": 16384,
      "provider": "ollama",
      "tokensPerSecond": 50,
      "createTime": "2024-11-08T12:36:45.258213Z",
      "routable": true,
      "quantization": "Q4_K_M",
      "regions": [
        {
          "name": "us-west1",
          "pricePerHour": "0.24082",
          "cloud": "GCP",
          "instanceType": "n1-standard-8",
          "gpuCount": 1,
          "gpuName": "nvidia-tesla-t4"
        }
      ]
    }
  ]
}

📘
Note
When using the routing capabilities, be aware that while recommendations are based on full precision models, the actual deployed models may use lower precision quantization for optimal performance. Consider this difference when evaluating model performance against recommendations.

Register LLM providers

To enable the AI Enabler Proxy to route your requests to the appropriate LLM provider, you must register the providers you want to use (e.g., OpenAI, Gemini, Groq, Azure).

To register the LLM providers, make a POST request to the relevant Cast AI API endpoint. Below is an example of OpenAI, Azure, Gemini, and VertexAI providers being registered, specifying authentication, available models, and provider-specific parameters.

curl https://api.cast.ai/v1/llm/providers \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -H 'X-API-Key: $CASTAI_API_KEY' \
  -X POST -d '{
  "providers": [
    {
      "name": "openai-gpt3.5",
      "supportedProvider": "OPENAI",
      "apiKey": "<openai-api-key-1>",
      "models": ["gpt-3.5-turbo-0125"]
    },
    {
      "name": "openai-gpt4+",
      "supportedProvider": "OPENAI",
      "apiKey": "<openai-api-key-2>",
      "models": ["gpt-4o-2024-05-13", "gpt-4-0613"]
    },
    // Azure OpenAI configuration
    {
      "name": "azure-provider",
      "supportedProvider": "AZURE",
      "url": "https://something-azure-openai.openai.azure.com",
      "apiKey": "<azure-api-key>",
      "apiVersion": "2024-02-01",
      "models": ["gpt-3.5-turbo-0125", "gpt-3.5-turbo-0301", "gpt-4o"],
      "isHosted": true
    },
    // Google's Gemini API configuration
    {
      "name": "gemini-api-provider",
      "supportedProvider": "GEMINI",
      "apiKey": "<gemini-api-key>",
      "models": ["gemini-1.5-flash", "gemini-1.5-pro"]
    },
    // Google Cloud Vertex AI Gemini configuration
    {
      "name": "vertex-ai-gemini-provider",
      "supportedProvider": "VERTEXAIGEMINI",
      "apiKey": "<gcloud-access-token>",
      "models": ["gemini-1.5-flash", "gemini-1.5-pro"],
      "url": "https://us-central1-aiplatform.googleapis.com/v1/projects/some-project/locations/us-central1",
      "isHosted": true
    }
  ]
}'

Replace $CASTAI_API_KEY with your actual Cast AI API key, and <api_key> with the API key for the provider you are registering.
Modify the supportedProvider field to match the provider you are registering.
Specify the models you want to use for each provider in the models array.
The isHosted field specifies whether the LLM Provider is hosted on the user side and should be picked over the non-hosted ones.

Note that you may register a single Provider multiple times. For instance, you can have an OpenAI Provider per OpenAI API Key to limit the models that can be used by each API Key.

📘
Note
The Provider API Keys are not stored on the CAST AI side. They are securely stored in a Secret Vault and accessed only when proxying/routing requests. CAST AI stores only the last 4 characters of each used API Key for reporting purposes.

Configure the Proxy

To configure the Proxy's behavior, such as enabling request routing and prompt sharing, follow these steps:

Make a PUT request to the Cast AI API endpoint for updating proxy settings:

curl https://api.cast.ai/v1/llm/settings \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \
-X PUT -d '{"promptSharingEnabled": true, "routingEnabled": true, "apiKey": <cast-ai-api-key>}'

Set promptSharingEnabled to true for Cast AI to store the prompts and allow you to provide feedback on prompt categorization and response quality. This feedback is used to improve the Proxy's decision-making.
Set routingEnabled to true to enable request routing to the registered providers. If set to false, requests can only be proxied to OpenAI. No other Provider is supported for proxying.
(Optional) Set the apiKey to the Cast AI API Key, which should have these settings configured. If apiKey is unset, the settings will be organization-wide.

Make requests to the Proxy

To start making requests to the AI Enabler Proxy running on the Cast AI platform, follow these steps:

Generate an API Access Key from your Cast AI account.
Include the API Access Key in the X-API-Key header or the Authorization header with the Bearer schema when making requests to the Proxy endpoint.
Make a POST request to the Proxy endpoint with the desired payload:

curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "What kind of instance types to use in GCP for running an AI training model?"
    }
  ]
}'

curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \ \
-X POST -d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "What kind of instance types to use in GCP for running an AI training model?"
    }
  ]
}'

Modify the request payload as needed, following the OpenAI API Reference documentation.

📘
Note
You can specify any model that you've registered, and Cast AI will route the request to the appropriate provider.

Supported endpoints

Different tools and integrations may require different base URLs for the AI Enabler Proxy. The default endpoint works with most standard OpenAI SDK implementations and tools like Azure Prompt Flow. Here's a table of known endpoint requirements:

Tool/Integration	Base URL	Notes
Default	`https://llm.cast.ai/openai/v1/chat/completions`	Use for OpenAI SDK, Azure Prompt Flow, and similar tools
LangChain	`https://llm.cast.ai/openai/v1`	Required for LangChain integration
MemGPT	`https://llm.cast.ai/openai`	Required for MemGPT integration

If you use a tool or SDK not listed here and encounter connectivity issues, try the default endpoint first. For tools requiring a different endpoint configuration, contact our team on the Slack community channel or Cast AI support.

We regularly update this list as we verify endpoint requirements for different tools and SDKs.

Run the AI Enabler Proxy in-cluster

If you prefer to run the AI Enabler Proxy in your own Kubernetes cluster, follow these steps:

Install the AI Enabler Proxy using Helm:

helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update

helm upgrade --install castai-ai-optimizer-proxy castai-helm/castai-ai-optimizer-proxy \
-n castai-agent --create-namespace \
--set castai.apiKey=<CASTAI_API_KEY>,castai.clusterID=<CLUSTER_ID>,castai.apiURL=https://api.cast.ai

Replace <CASTAI_API_KEY> with your actual Cast AI API key and <CLUSTER_ID> with the ID of your Kubernetes cluster.

Make requests to the in-cluster Proxy endpoint. The requests to the proxy are the same, except that you no longer need to provide any authorization header with the Cast AI API Key. If you have a pod running in the same cluster, you can access the Proxy like so:

curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-v -X POST -d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "How to use golang generics?"
    }
  ]
}'

Ensure you have registered the providers and adjusted the proxy settings on the Cast AI platform as described in prior sections.

Viewing your generative AI savings report

After setting up the AI Enabler Proxy, you'll want to see how much you're saving by using Cast AI's intelligent routing. The generative AI savings report becomes available once you make requests through the proxy.

Requirements for the savings report

To see your savings data:

Ensure you have properly registered your LLM providers
Make at least a few successful requests through the proxy
Wait a short time for the data to be processed (usually just a few minutes)

The report will automatically appear in your Cast AI console once there is actual usage data to analyze. This helps ensure the savings calculations are based on real traffic patterns rather than estimates.

Getting started

📣
Early Access Feature

Streaming

Supported providers

Model quantization and precision

Understanding model precision

Viewing model quantization

📘
Note

Register LLM providers

📘
Note

Configure the Proxy

Make requests to the Proxy

📘
Note

Supported endpoints

Run the AI Enabler Proxy in-cluster

Viewing your generative AI savings report

Requirements for the savings report

📣Early Access Feature

Streaming

Supported providers

Model quantization and precision

Understanding model precision

Viewing model quantization

📘Note

Register LLM providers

📘Note

Configure the Proxy

Make requests to the Proxy

📘Note

Supported endpoints

Run the AI Enabler Proxy in-cluster

Viewing your generative AI savings report

Requirements for the savings report

📣
Early Access Feature

📘
Note

📘
Note

📘
Note