Notifications

This Cast AI feature informs you via UI or webhook about key issues affecting the cluster. It also passes other valuable information, such as the daily vulnerability report. This guide outlines all notification types and examples you may see in Cast AI with relevant action points.

Once new items are ready for you to view, the bell icon in the top menu will show a count. You can view all items in the Notifications page.

Due to the dynamic nature of Kubernetes clusters, notifications are set to expire automatically in 24 hours.

Notification severity types

Cast AI uses several notification severity types to categorize the severity and importance of messages.

SeverityDescription
CriticalIndicates a severe issue that requires immediate attention and may significantly impact cluster operations.
ErrorSignifies a problem that is causing a malfunction or preventing expected behavior.
WarningAlerts about potential issues or situations that could lead to problems if not addressed.
InfoProvides general information about cluster operations, updates, or status changes.
SuccessConfirms that an operation or process has completed successfully.

Notification categories

Cast AI organizes notifications into the following categories:

  • Reporting anomalies - Cost and performance anomaly detection notifications
  • Inventory - New cloud provider instance availability notifications
  • Security - Runtime anomaly and image security notifications
  • Other - General cluster operations, system connectivity, and configuration issues

Complete notifications reference

This section provides a comprehensive list of all notifications that Cast AI can generate, organized by category and severity level.

Critical notifications

Other category

NotificationDescription
Cast AI agent is not able to connect to the APIThe Cast AI agent cannot communicate with Cast AI API endpoints. Check network connectivity, firewall restrictions, or authentication.
Cluster controller not respondingThe cluster controller component is unresponsive. This can prevent cluster management operations and autoscaling from functioning properly.
Failed to Reconcile ClusterA severe reconciliation failure occurred. This can happen when Cast AI service accounts are modified or there are significant configuration conflicts.
IP Address quota exceededYour cloud provider's IP address quota has been exceeded, preventing new nodes from being created with proper network connectivity.
Node Configuration Validation FailedNode configuration is invalid. Check the notification details for specific configuration errors.
Node deletion failedCast AI was unable to delete a node from the cluster. This may indicate permission issues or cloud provider API problems.
Operation failedA critical cluster operation has failed. Check notification details for which operation failed and why.
Spot Instance quota exceededYour cloud provider's Spot Instance quota has been exceeded, preventing Cast AI from launching cost-effective Spot Instances for your workloads. Additionally, since Spot Fallback is not enabled, the Autoscaler might not be able to add any capacity.
The Cast AI agent is unable to connect to the APIDuplicate of "Cast AI agent is not able to connect to API" notification.

Error notifications

Other category

NotificationDescription
Missing permission when adding a node to a target groupCast AI lacks the necessary IAM permissions to add nodes to target groups in your load balancer.
Missing permission when adding a node to load balancer(s)Cast AI lacks the necessary IAM permissions to add nodes to load balancers.
Missing permission when adding a node to target groupsCast AI lacks the necessary IAM permissions to add nodes to target groups.
Missing permission when adding VMSS IP address to a backend poolCast AI lacks the necessary Azure permissions to add Virtual Machine Scale Set IP addresses to backend pools.
Missing permission when deleting a node from target groupsCast AI lacks the necessary IAM permissions to remove nodes from target groups.
Missing permission when removing a node from load balancer(s)Cast AI lacks the necessary IAM permissions to remove nodes from load balancers.
SSO Connection problemThere is an issue with your Single Sign-On configuration preventing proper user authentication.

Warning notifications

Other category

NotificationDescription
Network traffic anomaly notificationsCast AI monitors network traffic patterns and alerts when unusual activity is detected. May include Cloud API, Internet, inter-region, or inter-zone traffic anomalies.
Resource overprovisioning anomaly notificationsCast AI detects when CPU or RAM resources are significantly overprovisioned, helping identify optimization opportunities.
Cost anomaly notificationsCast AI monitors cost metrics and alerts to unusual spending patterns. May focus on compute costs, CPU provisioning costs, or cost-per-resource metrics. Content varies based on detected patterns.
Cannot find valid instance types for the given workloadsCast AI cannot identify suitable instance types for your workloads. Consider adjusting workload resource requests or instance type preferences.
Continuous OOMKilled Events DetectedPods are being continuously killed due to out-of-memory conditions, indicating insufficient memory allocation or memory leaks.
Failed Helm Test of castai-workload-autoscalerThe Helm test for the Cast AI workload autoscaler component has failed, indicating potential deployment or configuration issues.
Failed to reconcile clusterA non-critical reconciliation issue occurred. While not immediately severe, this should be monitored and addressed.
GPU quota exceededYour cloud provider's GPU quota has been exceeded, preventing allocation of GPU resources for workloads that require them.
Outdated cluster-controllerThe cluster controller component is running an outdated version and should be updated for optimal functionality and security.
Unable to create castpoolarm. ARM VMs will not workCast AI cannot create the ARM instance pool, preventing ARM-based virtual machines from being used in your cluster.
Unable to update poolCast AI was unable to update a node pool configuration, which may prevent scaling operations or configuration changes.
Spot Instance quota exceededYour cloud provider's Spot Instance quota has been exceeded, preventing Cast AI from launching cost-effective Spot Instances for your workloads.

Reporting anomalies category

NotificationDescription
Cost anomaly notificationsCast AI monitors cost and efficiency metrics across your cluster. Notifications vary in focus and specificity, targeting different metric combinations. Each notification specifies which metrics triggered detection.

Info notifications

Inventory category

NotificationDescription
New machines available in AWSNew AWS instance types have become available and can now be used by Cast AI for your clusters.
New machines available in AzureNew Azure virtual machine sizes have become available and can now be used by Cast AI for your clusters.
New machines available in GCPNew Google Cloud Platform machine types have become available and can now be used by Cast AI for your clusters.

Other category

NotificationDescription
Read-Only access activatedCast AI has been activated in read-only mode, which means it can monitor your cluster but cannot make changes to it.
Trial expires soonYour Cast AI trial period is approaching its expiration date. Consider upgrading to a paid plan to continue using Cast AI features.
Trial has expiredYour Cast AI trial period has expired. Upgrade to a paid plan to restore full functionality.

Reporting anomalies category

NotificationDescription
Daily Vulnerability ReportYour daily security vulnerability report is available, containing information about potential security issues in your cluster workloads.

Success notifications

Other category

NotificationDescription
Cluster reconciledThe cluster has been successfully reconciled, indicating that Cast AI has successfully synchronized the desired cluster state with the actual state.

Taking action on notifications

When you receive notifications, consider the following general action steps:

  1. Critical notifications - Address immediately as they can severely impact cluster operations
  2. Error notifications - Investigate and resolve permission issues or configuration problems
  3. Warning notifications - Review for potential cost savings or performance improvements
  4. Info notifications - Stay informed about new capabilities and system status
  5. Success notifications - Confirm that operations completed as expected

For specific troubleshooting steps related to individual notifications, consult the relevant Cast AI documentation or contact support if the issue persists.