Rate limits
Rate limits for AI Enabler Model APIs serverless endpoints, including free tier limits, paid plan limits, and best practices for managing API request volume.
Rate limits control how many requests you can make to AI Enabler Model APIs within a given time period. These limits help maintain service stability and ensure fair access for all users.
Cast AI enforces two types of rate limits on serverless inference endpoints:
Requests per minute (RPM) limits the number of API calls you can make each minute.
Tokens per minute (TPM) limits the total number of input and output tokens processed each minute.
If you exceed either limit, the API returns an HTTP status code 429 Too Many Requests. Your application should implement retry logic with exponential backoff to handle rate limit responses gracefully.
Rate limits by plan
Rate limits vary by pricing plan and model size. Models are grouped into three categories based on parameter count.
| Plan | Model/Category | RPM | TPM |
|---|---|---|---|
| Free | Qwen 3 Coder Next | 50 | 300,000 |
| Starter | Small (≤8B) | 180 | 120,000 |
| Medium (8B–35B) | 90 | 60,000 | |
| Large (>35B) | 45 | 30,000 | |
| Growth | Small (≤8B) | 600 | 400,000 |
| Medium (8B–35B) | 300 | 200,000 | |
| Large (>35B) | 150 | 100,000 | |
| Enterprise | All | Custom | Custom |
The free tier requires no credit card. You can try all supported models at no cost within the rate limits above.
Model size categories
Rate limits are applied based on model size rather than per individual model. Larger models require more compute resources per request, so they have lower rate limits than smaller models.
| Category | Parameter range | Example models |
|---|---|---|
| Small | ≤8B parameters | Mistral 7B, Gemma 3 4B, Qwen 2.5 Coder 3B |
| Medium | 8B–35B parameters | Qwen 3 32B, Qwen 3 Coder Next, Gemma 3 27B |
| Large | >35B parameters | Llama 3.3 70B, GPT-OSS 120B |
When you make a request to a model, the rate limit for that model's size category is applied. For example, requests to Llama 3.3 70B FP8 count against your Large model limits.
Rate limit responses
When you exceed your rate limit, the API returns an HTTP status code 429 with the following in the response body, as an example:
{"error": "gpt-4o-mini model is rate limited until 2026-02-05T15:32:41Z"}The response includes a Retry-After header indicating how many seconds to wait before retrying:
Retry-After: 5
NoteIf you have multiple providers configured for a model, AI Enabler automatically attempts fallback to other available providers before returning a rate limit error.
Best practices
Respect the Retry-After header. When you receive a 429 response, wait the number of seconds specified in the Retry-After header before retrying.
Monitor your usage. Track your request volume and token consumption in the Analytics page for AI Enabler to better understand your usage patterns and plan capacity accordingly.
Use appropriate model sizes. Smaller models have higher rate limits. Choose the smallest model that meets your quality requirements for each use case.
Upgrading your plan
When you need higher rate limits, you can upgrade your plan:
- Navigate to AI Enabler > Settings > Pricing in the Cast AI console
- Select your desired plan
- Click Upgrade
- Enter your payment information
Rate limit increases take effect immediately after upgrading. For Enterprise plans with custom rate limits, contact our sales team.
NoteIf you're building a high-volume application and need rate limits beyond what's available in standard plans, reach out to discuss Enterprise options before you hit production.
See also
Updated about 3 hours ago
