Reliability Metrics Reference | Monitoring & Reporting

📣
Early Access Feature
This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

This page documents the castctl flags available for Reliability Metrics, explains how castctl automatically sizes components for your cluster, and describes what gets provisioned in the cluster during installation. For an overview of the feature and step-by-step installation instructions, see Reliability Metrics.

Public beta — castctl only

In the current public beta, Reliability Metrics is installed and managed exclusively through castctl. Helm chart configuration and GitOps installation methods are in development and will be documented here when available.

castctl cluster connect flags

These flags are used with castctl cluster connect to install and configure Reliability Metrics.

Flag	Description
`--reliability-metrics`	Enable Reliability Metrics during cluster connect. Required to install the feature.
`--non-interactive`	Skip all interactive prompts. Use this flag in CI/CD pipelines where there is no terminal.
`--cluster-name <name>`	Set the display name for the cluster being connected.
`--dry-run`	Print what would be installed without making any changes to the cluster. Useful for reviewing the install plan before committing.

These flags can be combined with flags for other Cast AI features in a single command:

castctl cluster connect \
  --non-interactive \
  --cluster-name production-us-east \
  --reliability-metrics \
  --cluster-optimization \
  --workload-autoscaler

castware upgrade flags

Use castctl castware upgrade to upgrade Reliability Metrics and all other Cast AI components in your cluster.

Flag	Description
`--non-interactive`	Skip confirmation prompts. Use in CI/CD pipelines.

castctl castware upgrade --non-interactive

Upgrading with castware preserves your existing configuration and applies any new data migrations automatically when the collector restarts.

How castctl sizes components

When you run castctl cluster connect --reliability-metrics, castctl inspects your cluster before installing anything and automatically determines the right resource allocation for two components: the collector agent that runs on each node, and the metrics database.

You do not need to specify sizing manually. The selections below describe what castctl chooses and why.

Collector agent sizing

The collector agent runs as a DaemonSet — one instance per node — and observes HTTP and gRPC traffic on that node. castctl counts the maximum number of containers running on any single node and selects a sizing profile accordingly.

Max containers on any node	Selected profile	CPU request	CPU limit	Memory request	Memory limit
Up to 5	`small`	50m	200m	128 MiB	256 MiB
Up to 15	`medium`	100m	400m	256 MiB	512 MiB
Up to 30	`large`	200m	800m	512 MiB	1 GiB
More than 30	`xlarge`	400m	1600m	1 GiB	2 GiB

These profiles reflect expected traffic volume per node at each workload density level. A node running many containers typically serves more concurrent requests and needs more resources to observe them accurately.

Metrics database sizing

castctl also provisions a ClickHouse time-series database in your cluster for temporary metric storage. It sizes the database by summing your cluster's total allocatable CPU across all nodes.

Total cluster allocatable CPU	CPU request	Memory request	Memory limit
Less than 100 cores	500m	1 GiB	2 GiB
Less than 500 cores	1 core	2 GiB	4 GiB
Less than 1,000 cores	2 cores	4 GiB	8 GiB
Less than 5,000 cores	2 cores	8 GiB	16 GiB
Less than 10,000 cores	4 cores	16 GiB	32 GiB
10,000 cores or more	4 cores	32 GiB	64 GiB

The default persistent volume size for the metrics database is 10 GiB. Data collected in the cluster is retained for up to 7 days before being purged — it is exported to Cast AI continuously and only needs to be stored locally for a short window.

ClickHouse operator detection

castctl checks whether a ClickHouse operator (Altinity) is already running in your cluster before installing. If one is found, castctl skips the operator installation and reuses the existing one. This prevents conflicts on clusters that already use ClickHouse for other workloads.

If your cluster has restricted Custom Resource Definition permissions and the operator cannot be installed automatically, castctl will surface an error. In this case, install the Altinity ClickHouse Operator CRDs manually first, then re-run castctl cluster connect --reliability-metrics.

What gets installed in your cluster

When Reliability Metrics is enabled, castctl installs the following components in the castai-agent namespace:

Component	Type	Role
Collector agent	DaemonSet — one pod per node	Observes HTTP and gRPC network traffic and sends measurements to the metrics database
Metrics database	StatefulSet (ClickHouse)	Stores and aggregates traffic measurements temporarily before they are exported to Cast AI
Exporter	Deployment	Reads aggregated data from the metrics database and streams it to Cast AI continuously
ClickHouse Operator	Deployment (cluster-scoped)	Manages the lifecycle of the ClickHouse StatefulSet. Skipped if an operator is already present.

All components are managed by castctl and castctl castware upgrade. You do not need to interact with them directly under normal operation.

Default monitored ports

The collector observes traffic on these ports by default:

Port	Common use
8080	HTTP application traffic
8443	HTTPS application traffic
8090	Common alternative HTTP port
6379	Redis

Services on other ports

Traffic on ports not in this list is not collected. Configuring additional ports requires Helm support, which is coming in a future release. If your services run exclusively on other ports, contact Cast AI support.

Coming soon: Helm configuration

Helm chart and GitOps installation support is planned for a future release. When available, it will allow you to configure:

Additional monitored ports
Collector resource sizing overrides
Namespace and workload exclusions
Metrics database storage class and volume size
Collector resource autoscaling via Vertical Pod Autoscaler

This page will be updated with the full Helm values reference when Helm installation is released.