Alert dependencies

Note

The alert dependencies in this section apply to the latest supported Cluster releases.

Using alert inhibition rules, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Alert inhibition rules are enabled by default.

The following table describes the dependency between alerts. Once an alert from the Alert column raises, the alert from the Inhibits and rules column will be suppressed with the Inhibited status in the Alertmanager web UI.

The Inhibits and rules column lists the labels and conditions, if any, for the inhibition to apply.

Alert

Inhibits and rules

cAdvisorTargetsOutage

cAdvisorTargetDown

CalicoTargetsOutage

CalicoTargetDown

CephClusterFullCritical

CephClusterFullWarning

CephClusterHealthCritical

CephClusterHealthWarning

CephOSDDiskNotResponding

CephOSDDown with the same rook_cluster label Before 17.0.0, 16.0.0, 14.1.0

CephOSDDiskUnavailable

CephOSDDown with the same rook_cluster label Before 17.0.0, 16.0.0, 14.1.0

CephOSDNodeDown Since 17.0.0, 16.0.0, 14.1.0

With the same node label:

  • CephOSDDiskNotResponding

  • CephOSDDiskUnavailable

CephOSDPgNumTooHighCritical

CephOSDPgNumTooHighWarning

DockerSwarmServiceReplicasFlapping

DockerSwarmServiceReplicasDown with the same service_id, service_mode, and service_name labels

DockerSwarmServiceReplicasOutage

DockerSwarmServiceReplicasDown with the same service_id, service_mode, and service_name labels

etcdDbSizeCritical

etcdDbSizeMajor with the same job and instance labels

etcdHighNumberOfFailedGRPCRequestsCritical

etcdHighNumberOfFailedGRPCRequestsWarning with the same grpc_method, grpc_service, job, and instance labels

ExternalEndpointDown

ExternalEndpointTCPFailure with the same instance and job labels

FileDescriptorUsageMajor

FileDescriptorUsageWarning with the same node label

FluentdTargetsOutage

FluentdTargetDown

KubeAPICertExpirationHigh

KubeAPICertExpirationMedium

KubeAPIErrorsHighMajor

KubeAPIErrorsHighWarning with the same instance label

KubeAPIOutage

KubeAPIDown

KubeAPIResourceErrorsHighMajor

KubeAPIResourceErrorsHighWarning with the same instance, resource, and subresource labels

KubeClientCertificateExpirationInOneDay

KubeClientCertificateExpirationInSevenDays with the same instance label

KubeDaemonSetOutage

  • CalicoTargetsOutage

  • KubeDaemonSetRolloutStuck with the same daemonset and namespace labels

  • FluentdTargetsOutage

  • NodeExporterTargetsOutage

  • TelegrafSMARTTargetsOutage

KubeDeploymentOutage

  • KubeDeploymentReplicasMismatch with the same deployment and namespace labels

  • GrafanaTargetDown

  • KubeDNSTargetsOutage Removed in 17.0.0, 16.0.0, 14.1.0

  • KubernetesMasterAPITargetsOutage

  • KubeStateMetricsTargetDown

  • PrometheusEsExporterTargetDown

  • PrometheusMsTeamsTargetDown

  • PrometheusRelayTargetDown

  • ServiceNowWebhookReceiverTargetDown

  • SfNotifierTargetDown

  • TelegrafDockerSwarmTargetDown

  • TelegrafOpenstackTargetDown

KubeJobFailed

KubePodsNotReady for created_by_kind=Job and with the same created_by_name label (removed in 17.0.0, 16.0.0, 14.1.0)

KubeletTargetsOutage

KubeletTargetDown

KubePersistentVolumeUsageCritical

With the same namespace and persistentvolumeclaim labels:

  • KubePersistentVolumeFullInFourDays

  • OpenSearchStorageUsageCritical
    Since 2.26.0 (17.1.0 and 16.1.0)
  • OpenSearchStorageUsageMajor
    Since 2.26.0 (17.1.0 and 16.1.0)

KubePodsCrashLooping

KubePodsRegularLongTermRestarts with the same created_by_name, created_by_kind, and namespace labels

KubeStatefulSetOutage

  • Alerts with the same namespace and statefulset labels:

    • KubeStatefulSetUpdateNotRolledOut

    • KubeStatefulSetReplicasMismatch

  • AlertmanagerTargetDown Since 17.0.0, 16.0.0, 14.1.0

  • AlertmanagerClusterTargetDown Before 17.0.0, 16.0.0, 14.1.0

  • ElasticsearchExporterTargetDown

  • FluentdTargetsOutage

  • OpenSearchClusterStatusCritical

  • PostgresqlReplicaDown

  • PostgresqlTargetDown Since 17.0.0, 16.0.0, 14.1.0

  • PostgresqlTargetsOutage Before 17.0.0, 16.0.0, 14.1.0

  • PrometheusEsExporterTargetDown

  • PrometheusServerTargetDown Since 17.0.0, 16.0.0, 14.1.0

  • PrometheusServerTargetsOutage Before 17.0.0, 16.0.0, 14.1.0

MCCLicenseExpirationHigh

MCCLicenseExpirationMedium

MCCSSLCertExpirationHigh

MCCSSLCertExpirationMedium with the same namespace and service_name labels

MCCSSLProbesServiceTargetOutage

MCCSSLProbesEndpointTargetOutage with the same namespace and service_name labels

MKEAPICertExpirationHigh

MKEAPICertExpirationMedium

MKEAPIOutage

MKEAPIDown

MKEMetricsEngineTargetsOutage

MKEMetricsEngineTargetDown

MKENodeDiskFullCritical

MKENodeDiskFullWarning with the same node label

NodeDown

  • KubeDaemonSetMisScheduled for the following DaemonSets:

    • cadvisor

    • csi-cephfsplugin

    • csi-cinder-nodeplugin

    • csi-rbdplugin

    • fluentd-logs

    • local-volume-provisioner

    • metallb-speaker

    • openstack-ccm

    • prometheus-libvirt-exporter

    • prometheus-node-exporter

    • rook-discover

    • telegraf-ds-smart

    • ucp-metrics

  • KubeDaemonSetRolloutStuck for the calico-node and ucp-nvidia-device-plugin DaemonSets

  • For resource=nodes:

    • KubeAPIResourceErrorsHighMajor

    • KubeAPIResourceErrorsHighWarning

  • Alerts with the same node label:

    • cAdvisorTargetDown

    • CalicoTargetDown

    • FluentdTargetDown

    • KubeletDown

    • KubeletTargetDown

    • KubeNodeNotReady

    • LibvirtExporterTargetDown

    • MKEMetricsEngineTargetDown

    • MKENodeDown

    • NodeExporterTargetDown

    • TelegrafSMARTTargetDown

    Since Cluster releases 17.0.0, 16.0.0, and 14.1.0:

    • AlertmanagerTargetDown

    • CephClusterTargetDown

    • etcdTargetDown

    • GrafanaTargetDown

    • HelmControllerTargetDown

    • KubeAPIDown

    • MCCCacheTargetDown

    • MCCControllerTargetDown

    • MCCProviderTargetDown

    • MKEAPIDown

    • PostgresqlTargetDown

    • PrometheusMsTeamsTargetDown

    • PrometheusRelayTargetDown

    • PrometheusServerTargetDown

    • ServiceNowWebhookReceiverTargetDown

    • SfNotifierTargetDown

    • TelegrafDockerSwarmTargetDown

    • TelemeterClientTargetDown

    • TelemeterServerFederationTargetDown

    • TelemeterServerTargetDown

NodeExporterTargetsOutage

NodeExporterTargetDown

OpenSearchClusterStatusCritical

  • OpenSearchClusterStatusWarning and OpenSearchNumberOfUnassignedShards with the same cluster label

  • For created_by_name=~"elasticsearch-curator-.":

    • KubeJobFailed

    • KubePodsNotReadyRemoved in 17.0.0, 16.0.0, 14.1.0

OpenSearchClusterStatusWarning Since 2.26.0 (17.1.0 and 16.1.0)

  • OpenSearchNumberOfUnassignedShards with the same cluster label

OpenSearchHeapUsageCritical

OpenSearchHeapUsageWarning with the same cluster and name labels

OpenSearchStorageUsageCritical
Since 2.26.0 (17.1.0 and 16.1.0)

KubePersistentVolumeFullInFourDays and OpenSearchStorageUsageMajor with the same namespace and persistentvolumeclaim labels

OpenSearchStorageUsageMajor
Since 2.26.0 (17.1.0 and 16.1.0)

KubePersistentVolumeFullInFourDays with the same namespace and persistentvolumeclaim labels

PostgresqlPatroniClusterUnlocked

With the same cluster and namespace labels:

  • PostgresqlReplicationNonStreamingReplicas

  • PostgresqlReplicationPaused

PostgresqlReplicaDown

  • Alerts with the same cluster and namespace labels:

    • PostgresqlReplicationNonStreamingReplicas

    • PostgresqlReplicationPaused

    • PostgresqlReplicationSlowWalApplication

    • PostgresqlReplicationSlowWalDownload

    • PostgresqlReplicationWalArchiveWriteFailing

PrometheusErrorSendingAlertsMajor

PrometheusErrorSendingAlertsWarning with the same alertmanager and pod labels

SystemDiskFullMajor

SystemDiskFullWarning with the same device, mountpoint, and node labels

SystemDiskInodesFullMajor

SystemDiskInodesFullWarning with the same device, mountpoint, and node labels

SystemLoadTooHighCritical

SystemLoadTooHighWarning with the same node label

SystemMemoryFullMajor

SystemMemoryFullWarning with the same node label

SSLCertExpirationHigh

SSLCertExpirationMedium with the same instance label

TelegrafSMARTTargetsOutage

TelegrafSMARTTargetDown

TelemeterServerTargetDown

TelemeterServerFederationTargetDown