Rate limits

Rate limits for AI Enabler Model APIs serverless endpoints, including free tier limits, paid plan limits, and best practices for managing API request volume.

Rate limits control how many requests you can make to AI Enabler Model APIs within a given time period. These limits help maintain service stability and ensure fair access for all users.

Cast AI enforces two types of rate limits on serverless inference endpoints:

Requests per minute (RPM) limits the number of API calls you can make each minute.

Tokens per minute (TPM) limits the total number of input and output tokens processed each minute.

If you exceed either limit, the API returns an HTTP status code 429 Too Many Requests. Your application should implement retry logic with exponential backoff to handle rate limit responses gracefully.

Rate limits by plan

Rate limits vary by pricing plan and model size. Models are grouped into three categories based on parameter count.

PlanModel/CategoryRPMTPM
FreeQwen 3 Coder Next50300,000
StarterSmall (≤8B)180120,000
Medium (8B–35B)9060,000
Large (>35B)4530,000
GrowthSmall (≤8B)600400,000
Medium (8B–35B)300200,000
Large (>35B)150100,000
EnterpriseAllCustomCustom

The free tier requires no credit card. You can try all supported models at no cost within the rate limits above.

Model size categories

Rate limits are applied based on model size rather than per individual model. Larger models require more compute resources per request, so they have lower rate limits than smaller models.

CategoryParameter rangeExample models
Small≤8B parametersMistral 7B, Gemma 3 4B, Qwen 2.5 Coder 3B
Medium8B–35B parametersQwen 3 32B, Qwen 3 Coder Next, Gemma 3 27B
Large>35B parametersLlama 3.3 70B, GPT-OSS 120B

When you make a request to a model, the rate limit for that model's size category is applied. For example, requests to Llama 3.3 70B FP8 count against your Large model limits.

Rate limit responses

When you exceed your rate limit, the API returns an HTTP status code 429 with the following in the response body, as an example:

{"error": "gpt-4o-mini model is rate limited until 2026-02-05T15:32:41Z"}

The response includes a Retry-After header indicating how many seconds to wait before retrying:

Retry-After: 5
📘

Note

If you have multiple providers configured for a model, AI Enabler automatically attempts fallback to other available providers before returning a rate limit error.

Best practices

Respect the Retry-After header. When you receive a 429 response, wait the number of seconds specified in the Retry-After header before retrying.

Monitor your usage. Track your request volume and token consumption in the Analytics page for AI Enabler to better understand your usage patterns and plan capacity accordingly.

Use appropriate model sizes. Smaller models have higher rate limits. Choose the smallest model that meets your quality requirements for each use case.

Upgrading your plan

When you need higher rate limits, you can upgrade your plan:

  1. Navigate to AI Enabler > Settings > Pricing in the Cast AI console
  2. Select your desired plan
  3. Click Upgrade
  4. Enter your payment information

Rate limit increases take effect immediately after upgrading. For Enterprise plans with custom rate limits, contact our sales team.

📘

Note

If you're building a high-volume application and need rate limits beyond what's available in standard plans, reach out to discuss Enterprise options before you hit production.

See also