Getting started

The LLM Proxy allows you to route requests to the best and cheapest Large Language Model (LLM). This guide provides instructions on how to configure and use the LLM Proxy.

LLM Proxy is a feature that allows you to route requests to different Large Language Model (LLM) providers based on complexity and associated cost.

You can run the LLM Proxy in your Kubernetes cluster or use the one on the CAST AI platform. In both cases, the Proxy expects the request to follow the OpenAI API contract described in the OpenAI API Reference documentation. The response will also follow the OpenAI API contract.

The only supported endpoint is the/openai/v1/chat/completions, which mimics the OpenAI's /v1/chat/completions endpoint. Streaming is not currently supported.

Supported providers

You can find the list of supported LLM providers and their supported models here. CAST AI can proxy requests to any provider and model combination from this list once they are registered.

Register LLM providers

To enable the LLM Proxy to route your requests to the appropriate LLM provider, you must register the providers you want to use (e.g., OpenAI, Gemini, Groq, Azure).

To register the LLM providers, make a POST request to the relevant CAST AI API endpoint. Below is an example of OpenAI, Azure, Gemini, and VertexAI providers being registered, specifying authentication, available models, and provider-specific parameters.

curl https://api.cast.ai/v1/llm/providers \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/json' \
  -H 'X-API-Key: $CASTAI_API_KEY' \
  -X POST -d '{
  "providers": [
    {
      "name": "openai-gpt3.5",
      "supportedProvider": "OPENAI",
      "apiKey": "<openai-api-key-1>",
      "models": ["gpt-3.5-turbo-0125"]
    },
    {
      "name": "openai-gpt4+",
      "supportedProvider": "OPENAI",
      "apiKey": "<openai-api-key-2>",
      "models": ["gpt-4o-2024-05-13", "gpt-4-0613"]
    },
    // Azure OpenAI configuration
    {
      "name": "azure-provider",
      "supportedProvider": "AZURE",
      "url": "https://something-azure-openai.openai.azure.com",
      "apiKey": "<azure-api-key>",
      "apiVersion": "2024-02-01",
      "models": ["gpt-3.5-turbo-0125", "gpt-3.5-turbo-0301", "gpt-4o"],
      "isHosted": true
    },
    // Google's Gemini API configuration
    {
      "name": "gemini-api-provider",
      "supportedProvider": "GEMINI",
      "apiKey": "<gemini-api-key>",
      "models": ["gemini-1.5-flash", "gemini-1.5-pro"]
    },
    // Google Cloud Vertex AI Gemini configuration
    {
      "name": "vertex-ai-gemini-provider",
      "supportedProvider": "VERTEXAIGEMINI",
      "apiKey": "<gcloud-access-token>",
      "models": ["gemini-1.5-flash", "gemini-1.5-pro"],
      "url": "https://us-central1-aiplatform.googleapis.com/v1/projects/some-project/locations/us-central1",
      "isHosted": true
    }
  ]
}'
  1. Replace $CASTAI_API_KEY with your actual CAST AI API key, and <api_key> with the API key for the provider you are registering.
  2. Modify the supportedProvider field to match the provider you are registering.
  3. Specify the models you want to use for each provider in the models array.
  4. The isHosted field specifies whether the LLM Provider is hosted on the user side and should be picked over the non-hosted ones.

Note that you may register a single Provider multiple times. For instance, you can have an OpenAI Provider per OpenAI API Key to limit the models that can be used by each API Key.

📘

Note

The Provider API Keys are not stored on the CAST AI side. They are securely stored in a Secret Vault and accessed only when proxying/routing requests. CAST AI stores only the last 4 characters of each used API Key for reporting purposes.

Configure the Proxy

To configure the Proxy's behavior, such as enabling request routing and prompt sharing, follow these steps:

  1. Make a PUT request to the CAST AI API endpoint for updating proxy settings:
curl https://api.cast.ai/v1/llm/settings \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'X-API-Key: $CASTAI_API_KEY' \
-X PUT -d '{"promptSharingEnabled": true, "routingEnabled": true, "apiKey": <cast-ai-api-key>}'
  1. Set promptSharingEnabled to true for CAST AI to store the prompts and allow you to provide feedback on prompt categorization and response quality. This feedback is used to improve the Proxy's decision-making.
  2. Set routingEnabled to true to enable request routing to the registered providers. If set to false, requests can only be proxied to OpenAI. No other Provider is supported for proxying.
  3. (Optional) Set the apiKey to the CAST AI API Key, which should have these settings configured. If apiKey is unset, the settings will be organization-wide.

Make requests to the Proxy

To start making requests to the LLM Proxy running on the CAST AI platform, follow these steps:

  1. Generate an API Access Key from your CAST AI account.
  2. Include the API Access Key in the X-API-Key header or the Authorization header with the Bearer schema when making requests to the Proxy endpoint.
  3. Make a POST request to the Proxy endpoint with the desired payload:
curl https://llm.cast.ai/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer $CASTAI_API_KEY' \
-X POST -d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "What kind of instance types to use in GCP for running an AI training model?"
    }
  ]
}'

Modify the request payload as needed, following the OpenAI API Reference documentation.

📘

Note

You can specify any model that you've registered, and CAST AI will route the request to the appropriate provider.

Run the LLM Proxy in-cluster

If you prefer to run the LLM Proxy in your own Kubernetes cluster, follow these steps:

  1. Install the LLM Proxy using Helm:
helm repo add castai-helm https://castai.github.io/helm-charts
helm repo update
helm upgrade --install castai-ai-optimizer-proxy castai-helm/castai-ai-optimizer-proxy \
-n castai-agent --create-namespace \
--set castai.apiKey=<CASTAI_API_KEY>,castai.clusterID=<CLUSTER_ID>,castai.apiURL=https://api.cast.ai

Replace <CASTAI_API_KEY> with your actual CAST AI API key and <CLUSTER_ID> with the ID of your Kubernetes cluster.

  1. Make requests to the in-cluster Proxy endpoint. The requests to the proxy are the same, except that you no longer need to provide any authorization header with the CAST AI API Key. If you have a pod running in the same cluster, you can access the Proxy like so:
curl http://castai-ai-optimizer-proxy.castai-agent.svc.cluster.local:443/openai/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-v -X POST -d '{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "How to use golang generics?"
    }
  ]
}'

Ensure you have registered the providers and adjusted the proxy settings on the CAST AI platform as described in prior sections.