Dashboard Token Usage | AI Enabler

The Cast AI Analytics Dashboard tracks token usage for all requests sent to https://llm.cast.ai/openai/v1. Token counts (prompt_tokens, completion_tokens, total_tokens) are read directly from each provider's API response — Cast AI does not recalculate them. For models hosted by Cast AI, these counts come from vLLM's inference engine.

IDE and CLI tools — such as OpenCode — make additional requests you may not see in the tool's own status display. The extra requests and tokens typically come from:

A system prompt injected before your message
Tool definitions (function schemas)
Requests the tool makes autonomously, such as generating a conversation title

The dashboard, therefore, shows more requests and higher token usage than the tool reports.

Example: cURL vs. OpenCode

The following example uses the same prompt sent via cURL and via OpenCode to illustrate the difference.

cURL request

Replace $YOUR_CAST_AI_KEY with your actual Cast AI API key before running the command.

curl -k https://llm.cast.ai/openai/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $YOUR_CAST_AI_KEY" \
    -d '{
      "model": "minimax-m2.7",
      "messages": [
        {"role": "user", "content": "Generate a story. The story should be detailed and approximately 250 tokens long."}
      ]
    }'

This produces a single request billed at 54 input tokens and 5,246 output tokens. The high output token count is expected — the model initially produced a larger response and self-corrected before returning the final result.

OpenCode request

Running the same prompt in OpenCode (version 1.3.0):

opencode
# Switch model to minimax-m2.7
# Type: Generate a story. The story should be detailed and approximately 250 tokens long.
# Submit

OpenCode reports 2,339 tokens: 1,973 input and 366 output. However, OpenCode also makes a second request on your behalf in the background to generate a title for the conversation. That request is not shown in the tool's status but is visible in the dashboard.

Summary

Tool	Requests	Input tokens	Output tokens
cURL	1	54	5,246
OpenCode	2	2,513	559

The 2,513 input tokens billed for OpenCode may be surprising given that the user prompt was only 54 tokens — the rest comes from OpenCode's system prompt, tool definitions, and the background title request. The Cast AI dashboard shows all requests, including those sent autonomously by the tool. Interested in finding what HTTP requests are sent by the tool to an LLM? Continue to the next section.

Inspect traffic from your tool

To see exactly what requests your tool sends to the API, use an HTTP/HTTPS proxy. The steps below use mitmproxy on macOS.

Prerequisites

macOS (the steps are easily portable to Linux — only the mitmproxy installation and certificate trust commands differ)
Admin access — required to install a self-signed CA certificate into the system keychain
OpenCode configured following the OpenCode setup tutorial

Steps

1. Install mitmproxy

brew install mitmproxy

2. Start mitmproxy

mitmweb --listen-port 8888

This opens a web UI at http://localhost:8081 where you can inspect all traffic.

3. Install the CA certificate

mitmproxy generates a CA certificate on first run at ~/.mitmproxy/.

sudo security add-trusted-cert -d -r trustRoot \
    -k /Library/Keychains/System.keychain \
    ~/.mitmproxy/mitmproxy-ca-cert.pem

4. Run OpenCode through the proxy

HTTP_PROXY=http://localhost:8888 HTTPS_PROXY=http://localhost:8888 \
    NODE_TLS_REJECT_UNAUTHORIZED=0 opencode

NODE_TLS_REJECT_UNAUTHORIZED=0 is required because the Node.js AI SDK may not trust the mitmproxy certificate by default.

5. Inspect traffic

Open http://localhost:8081 in your browser. You will see every request the tool makes, including the full request body (system prompt, tool definitions, messages) and the response with token usage.

6. Remove the CA certificate when done

sudo security remove-trusted-cert -d ~/.mitmproxy/mitmproxy-ca-cert.pem

Next steps

Rate limits

Understand request and token limits by plan before building for production.

Hosted model deployment

Deploy models on your own cluster when per-token costs exceed compute costs or compliance requires it.