Alert dependencies

Note

The alert dependencies in this section apply to the latest supported Cluster releases.

Using alert inhibition rules, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Alert inhibition rules are enabled by default.

The following table describes the dependency between alerts. Once an alert from the Alert column raises, the alert from the Inhibits and rules column will be suppressed with the Inhibited status in the Alertmanager web UI.

The Inhibits and rules column lists the labels and conditions, if any, for the inhibition to apply.

Alert

Inhibits and rules

cAdvisorTargetsOutage

cAdvisorTargetDown

CalicoTargetsOutage

CalicoTargetDown

CephClusterFullCritical

CephClusterFullWarning with the same rook_cluster label

CephClusterHealthCritical

CephClusterHealthWarning with the same rook_cluster label

CephNodeDown

CephOSDDiskUnavailable with the same node and rook_cluster labels

CephOSDDiskNotResponding

CephOSDDown with the same rook_cluster label

CephOSDDiskUnavailable

CephOSDDown

CephOSDPgNumTooHighCritical

CephOSDPgNumTooHighWarning with the same rook_cluster label

DockerSwarmServiceReplicasFlapping

DockerSwarmServiceReplicasDown with the same service_id, service_mode, and service_name labels

DockerSwarmServiceReplicasOutage

DockerSwarmServiceReplicasDown with the same service_id, service_mode, and service_name labels

etcdDbSizeCritical

etcdDbSizeMajor with the same job and instance labels

etcdHighNumberOfFailedGRPCRequestsCritical

etcdHighNumberOfFailedGRPCRequestsWarning with the same grpc_method, grpc_service, job, and instance labels

ExternalEndpointDown

ExternalEndpointTCPFailure with the same instance and job labels

FileDescriptorUsageMajor

FileDescriptorUsageWarning with the same node label

FluentdTargetsOutage

FluentdTargetDown

IronicBmApiOutage

IronicBmMetricsMissing

IronicBmTargetDown

IronicBmApiOutage

KubeAPICertExpirationHigh

KubeAPICertExpirationMedium

KubeAPIErrorsHighMajor

KubeAPIErrorsHighWarning

KubeAPIOutage

KubeAPIDown

KubeAPIResourceErrorsHighMajor

KubeAPIResourceErrorsHighWarning with the same instance, resource, and subresource labels

KubeClientCertificateExpirationInOneDay

KubeClientCertificateExpirationInSevenDays with the same instance label

KubeDaemonSetOutage

  • KubeDaemonSetRolloutStuck with the same daemonset and namespace labels

  • FluentdTargetsOutage

  • NodeExporterTargetsOutage

  • TelegrafArpingCheckTargetsOutage

  • TelegrafSMARTTargetsOutage

KubeDeploymentOutage

  • KubeDeploymentReplicasMismatch with the same deployment and namespace labels

  • GrafanaTargetDown

  • KubeDNSTargetsOutage

  • KubernetesMasterAPITargetsOutage

  • KubeStateMetricsTargetDown

  • PrometheusEsExporterTargetDown

  • PrometheusMsTeamsTargetDown

  • PrometheusRelayTargetDown

  • PushgatewayTargetDown

  • ServiceNowWebhookReceiverTargetDown

  • SfNotifierTargetDown

  • TelegrafDockerSwarmTargetDown

  • TelegrafOpenstackTargetDown

KubeJobFailed

KubePodsNotReady for created_by_kind=Job and with the same created_by_name label

KubeletTargetsOutage

KubeletTargetDown

KubePersistentVolumeUsageCritical

KubePersistentVolumeFullInFourDays with the same namespace and persistentvolumeclaim labels

KubePodsCrashLooping

KubePodsRegularLongTermRestarts with the same created_by_name, created_by_kind, and namespace labels

KubeStatefulSetOutage

  • Alerts with the same namespace and statefulset labels:

    • KubeStatefulSetUpdateNotRolledOut

    • KubeStatefulSetReplicasMismatch

  • AlertmanagerClusterTargetsOutage

  • ElasticsearchExporterTargetDown

  • FluentdTargetsOutage

  • OpenSearchClusterStatusCritical

  • PostgresqlReplicaDown

  • PostgresqlTargetsOutage

  • PrometheusEsExporterTargetDown

  • PrometheusServerTargetsOutage

MCCLicenseExpirationCritical

MCCLicenseExpirationMajor

MCCSSLCertExpirationHigh

MCCSSLCertExpirationMedium with the same namespace and service_name labels

MCCSSLProbesServiceTargetOutage

MCCSSLProbesEndpointTargetOutage with the same namespace and service_name labels

MKEAPICertExpirationHigh

MKEAPICertExpirationMedium

MKEAPIOutage

MKEAPIDown

MKEMetricsEngineTargetsOutage

MKEMetricsEngineTargetDown

MKENodeDiskFullCritical

MKENodeDiskFullWarning with the same node label

NodeDown

  • KubeDaemonSetMisScheduled for the following DaemonSets:

    • csi-cinder-nodeplugin

    • csi-rbdplugin

    • fluentd-logs

    • local-volume-provisioner

    • metallb-speaker

    • openstack-ccm

    • prometheus-node-exporters

    • rook-discover

    • ucp-metrics

  • KubeDaemonSetRolloutStuck for the calico-node and ucp-nvidia-device-plugin DaemonSets

  • For resource=nodes:

    • KubeAPIResourceErrorsHighMajor

    • KubeAPIResourceErrorsHighWarning

  • Alerts with the same node label:

    • KubeletDown

    • KubeNodeNotReady

    • MKENodeDown

    • cAdvisorTargetDown

    • CalicoTargetDown

    • FluentdTargetDown

    • KubeletTargetDown

    • LibvirtExporterTargetDown

    • MKEMetricsEngineTargetDown

    • NodeExporterTargetDown

    • TelegrafArpingCheckTargetDown

    • TelegrafSMARTTargetDown

NodeExporterTargetsOutage

NodeExporterTargetDown

OpenSearchClusterStatusCritical

  • OpenSearchClusterStatusWarning with the same cluster label

  • For created_by_name=~"elasticsearch-curator-.":

    • KubeJobFailed

    • KubePodsNotReady

OpenSearchHeapUsageCritical

OpenSearchHeapUsageWarning with the same cluster and name labels

PostgresqlPatroniClusterUnlocked

With the same cluster and namespace labels:

  • PostgresqlReplicationNonStreamingReplicas

  • PostgresqlReplicationPaused

PostgresqlReplicaDown

  • Alerts with the same cluster and namespace labels:

    • PostgresqlReplicationNonStreamingReplicas

    • PostgresqlReplicationPaused

    • PostgresqlReplicationSlowWalApplication

    • PostgresqlReplicationSlowWalDownload

    • PostgresqlReplicationWalArchiveWriteFailing

PrometheusErrorSendingAlertsMajor

PrometheusErrorSendingAlertsWarning with the same alertmanager and pod labels

SystemDiskFullMajor

SystemDiskFullWarning with the same device, mountpoint, and node labels

SystemDiskInodesFullMajor

SystemDiskInodesFullWarning with the same device, mountpoint, and node labels

SystemLoadTooHighCritical

SystemLoadTooHighWarning with the same node label

SystemMemoryFullMajor

SystemMemoryFullWarning with the same node label

SSLCertExpirationHigh

SSLCertExpirationMedium with the same instance label

TelegrafArpingCheckTargetsOutage

TelegrafArpingCheckTargetDown

TelegrafSMARTTargetsOutage

TelegrafSMARTTargetDown

TelemeterServerTargetDown

TelemeterServerFederationTargetDown