Reliability Metrics Reference

castctl flags, auto-sizing behavior, and what gets deployed in your cluster when you enable Reliability Metrics.

📣

Early Access Feature

This feature is in early access. It may undergo changes based on user feedback and continued development. We recommend testing in non-production environments first and welcome your feedback to help us improve.

This page documents the castctl flags available for Reliability Metrics, explains how castctl automatically sizes components for your cluster, and describes what gets provisioned in the cluster during installation. For an overview of the feature and step-by-step installation instructions, see Reliability Metrics.

Public beta — castctl only

In the current public beta, Reliability Metrics is installed and managed exclusively through castctl. Helm chart configuration and GitOps installation methods are in development and will be documented here when available.


castctl cluster connect flags

These flags are used with castctl cluster connect to install and configure Reliability Metrics.

FlagDescription
--reliability-metricsEnable Reliability Metrics during cluster connect. Required to install the feature.
--non-interactiveSkip all interactive prompts. Use this flag in CI/CD pipelines where there is no terminal.
--cluster-name <name>Set the display name for the cluster being connected.
--dry-runPrint what would be installed without making any changes to the cluster. Useful for reviewing the install plan before committing.

These flags can be combined with flags for other Cast AI features in a single command:

castctl cluster connect \
  --non-interactive \
  --cluster-name production-us-east \
  --reliability-metrics \
  --cluster-optimization \
  --workload-autoscaler

castware upgrade flags

Use castctl castware upgrade to upgrade Reliability Metrics and all other Cast AI components in your cluster.

FlagDescription
--non-interactiveSkip confirmation prompts. Use in CI/CD pipelines.
castctl castware upgrade --non-interactive

Upgrading with castware preserves your existing configuration and applies any new data migrations automatically when the collector restarts.


How castctl sizes components

When you run castctl cluster connect --reliability-metrics, castctl inspects your cluster before installing anything and automatically determines the right resource allocation for two components: the collector agent that runs on each node, and the metrics database.

You do not need to specify sizing manually. The selections below describe what castctl chooses and why.

Collector agent sizing

The collector agent runs as a DaemonSet — one instance per node — and observes HTTP and gRPC traffic on that node. castctl counts the maximum number of containers running on any single node and selects a sizing profile accordingly.

Max containers on any nodeSelected profileCPU requestCPU limitMemory requestMemory limit
Up to 5small50m200m128 MiB256 MiB
Up to 15medium100m400m256 MiB512 MiB
Up to 30large200m800m512 MiB1 GiB
More than 30xlarge400m1600m1 GiB2 GiB

These profiles reflect expected traffic volume per node at each workload density level. A node running many containers typically serves more concurrent requests and needs more resources to observe them accurately.

Metrics database sizing

castctl also provisions a ClickHouse time-series database in your cluster for temporary metric storage. It sizes the database by summing your cluster's total allocatable CPU across all nodes.

Total cluster allocatable CPUCPU requestMemory requestMemory limit
Less than 100 cores500m1 GiB2 GiB
Less than 500 cores1 core2 GiB4 GiB
Less than 1,000 cores2 cores4 GiB8 GiB
Less than 5,000 cores2 cores8 GiB16 GiB
Less than 10,000 cores4 cores16 GiB32 GiB
10,000 cores or more4 cores32 GiB64 GiB

The default persistent volume size for the metrics database is 10 GiB. Data collected in the cluster is retained for up to 7 days before being purged — it is exported to Cast AI continuously and only needs to be stored locally for a short window.

ClickHouse operator detection

castctl checks whether a ClickHouse operator (Altinity) is already running in your cluster before installing. If one is found, castctl skips the operator installation and reuses the existing one. This prevents conflicts on clusters that already use ClickHouse for other workloads.

If your cluster has restricted Custom Resource Definition permissions and the operator cannot be installed automatically, castctl will surface an error. In this case, install the Altinity ClickHouse Operator CRDs manually first, then re-run castctl cluster connect --reliability-metrics.


What gets installed in your cluster

When Reliability Metrics is enabled, castctl installs the following components in the castai-agent namespace:

ComponentTypeRole
Collector agentDaemonSet — one pod per nodeObserves HTTP and gRPC network traffic and sends measurements to the metrics database
Metrics databaseStatefulSet (ClickHouse)Stores and aggregates traffic measurements temporarily before they are exported to Cast AI
ExporterDeploymentReads aggregated data from the metrics database and streams it to Cast AI continuously
ClickHouse OperatorDeployment (cluster-scoped)Manages the lifecycle of the ClickHouse StatefulSet. Skipped if an operator is already present.

All components are managed by castctl and castctl castware upgrade. You do not need to interact with them directly under normal operation.


Default monitored ports

The collector observes traffic on these ports by default:

PortCommon use
8080HTTP application traffic
8443HTTPS application traffic
8090Common alternative HTTP port
6379Redis

Services on other ports

Traffic on ports not in this list is not collected. Configuring additional ports requires Helm support, which is coming in a future release. If your services run exclusively on other ports, contact Cast AI support.


Coming soon: Helm configuration

Helm chart and GitOps installation support is planned for a future release. When available, it will allow you to configure:

  • Additional monitored ports
  • Collector resource sizing overrides
  • Namespace and workload exclusions
  • Metrics database storage class and volume size
  • Collector resource autoscaling via Vertical Pod Autoscaler

This page will be updated with the full Helm values reference when Helm installation is released.