Reliability Metrics Reference
castctl flags, auto-sizing behavior, and what gets deployed in your cluster when you enable Reliability Metrics.
Early Access FeatureThis feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.
This page documents the castctl flags available for Reliability Metrics, explains how castctl automatically sizes components for your cluster, and describes what gets provisioned in the cluster during installation. For an overview of the feature and step-by-step installation instructions, see Reliability Metrics.
Public beta — castctl only
In the current public beta, Reliability Metrics is installed and managed exclusively through
castctl. Helm chart configuration and GitOps installation methods are in development and will be documented here when available.
castctl cluster connect flags
These flags are used with castctl cluster connect to install and configure Reliability Metrics.
| Flag | Description |
|---|---|
--reliability-metrics | Enable Reliability Metrics during cluster connect. Required to install the feature. |
--non-interactive | Skip all interactive prompts. Use this flag in CI/CD pipelines where there is no terminal. |
--cluster-name <name> | Set the display name for the cluster being connected. |
--dry-run | Print what would be installed without making any changes to the cluster. Useful for reviewing the install plan before committing. |
These flags can be combined with flags for other Cast AI features in a single command:
castctl cluster connect \
--non-interactive \
--cluster-name production-us-east \
--reliability-metrics \
--cluster-optimization \
--workload-autoscalercastware upgrade flags
Use castctl castware upgrade to upgrade Reliability Metrics and all other Cast AI components in your cluster.
| Flag | Description |
|---|---|
--non-interactive | Skip confirmation prompts. Use in CI/CD pipelines. |
castctl castware upgrade --non-interactiveUpgrading with castware preserves your existing configuration and applies any new data migrations automatically when the collector restarts.
How castctl sizes components
When you run castctl cluster connect --reliability-metrics, castctl inspects your cluster before installing anything and automatically determines the right resource allocation for two components: the collector agent that runs on each node, and the metrics database.
You do not need to specify sizing manually. The selections below describe what castctl chooses and why.
Collector agent sizing
The collector agent runs as a DaemonSet — one instance per node — and observes HTTP and gRPC traffic on that node. castctl counts the maximum number of containers running on any single node and selects a sizing profile accordingly.
| Max containers on any node | Selected profile | CPU request | CPU limit | Memory request | Memory limit |
|---|---|---|---|---|---|
| Up to 5 | small | 50m | 200m | 128 MiB | 256 MiB |
| Up to 15 | medium | 100m | 400m | 256 MiB | 512 MiB |
| Up to 30 | large | 200m | 800m | 512 MiB | 1 GiB |
| More than 30 | xlarge | 400m | 1600m | 1 GiB | 2 GiB |
These profiles reflect expected traffic volume per node at each workload density level. A node running many containers typically serves more concurrent requests and needs more resources to observe them accurately.
Metrics database sizing
castctl also provisions a ClickHouse time-series database in your cluster for temporary metric storage. It sizes the database by summing your cluster's total allocatable CPU across all nodes.
| Total cluster allocatable CPU | CPU request | Memory request | Memory limit |
|---|---|---|---|
| Less than 100 cores | 500m | 1 GiB | 2 GiB |
| Less than 500 cores | 1 core | 2 GiB | 4 GiB |
| Less than 1,000 cores | 2 cores | 4 GiB | 8 GiB |
| Less than 5,000 cores | 2 cores | 8 GiB | 16 GiB |
| Less than 10,000 cores | 4 cores | 16 GiB | 32 GiB |
| 10,000 cores or more | 4 cores | 32 GiB | 64 GiB |
The default persistent volume size for the metrics database is 10 GiB. Data collected in the cluster is retained for up to 7 days before being purged — it is exported to Cast AI continuously and only needs to be stored locally for a short window.
ClickHouse operator detection
castctl checks whether a ClickHouse operator (Altinity) is already running in your cluster before installing. If one is found, castctl skips the operator installation and reuses the existing one. This prevents conflicts on clusters that already use ClickHouse for other workloads.
If your cluster has restricted Custom Resource Definition permissions and the operator cannot be installed automatically, castctl will surface an error. In this case, install the Altinity ClickHouse Operator CRDs manually first, then re-run castctl cluster connect --reliability-metrics.
What gets installed in your cluster
When Reliability Metrics is enabled, castctl installs the following components in the castai-agent namespace:
| Component | Type | Role |
|---|---|---|
| Collector agent | DaemonSet — one pod per node | Observes HTTP and gRPC network traffic and sends measurements to the metrics database |
| Metrics database | StatefulSet (ClickHouse) | Stores and aggregates traffic measurements temporarily before they are exported to Cast AI |
| Exporter | Deployment | Reads aggregated data from the metrics database and streams it to Cast AI continuously |
| ClickHouse Operator | Deployment (cluster-scoped) | Manages the lifecycle of the ClickHouse StatefulSet. Skipped if an operator is already present. |
All components are managed by castctl and castctl castware upgrade. You do not need to interact with them directly under normal operation.
Default monitored ports
The collector observes traffic on these ports by default:
| Port | Common use |
|---|---|
| 8080 | HTTP application traffic |
| 8443 | HTTPS application traffic |
| 8090 | Common alternative HTTP port |
| 6379 | Redis |
Services on other ports
Traffic on ports not in this list is not collected. Configuring additional ports requires Helm support, which is coming in a future release. If your services run exclusively on other ports, contact Cast AI support.
Coming soon: Helm configuration
Helm chart and GitOps installation support is planned for a future release. When available, it will allow you to configure:
- Additional monitored ports
- Collector resource sizing overrides
- Namespace and workload exclusions
- Metrics database storage class and volume size
- Collector resource autoscaling via Vertical Pod Autoscaler
This page will be updated with the full Helm values reference when Helm installation is released.
