Alert dependencies

Using alert inhibition rules, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Alert inhibition rules are enabled by default.

The following tables describe the dependencies between the OpenStack-related and MOSK cluster alerts.

Once an alert from the Alert column raises, the alert from the Inhibits and rules column will be suppressed with the Inhibited status in the Alertmanager web UI.

The Inhibits and rules column lists the labels and conditions, if any, for the inhibition to apply.

Alert inhibition rules for OpenStack clusters

Alert

Inhibits and rules

CassandraTombstonesTooManyCritical

CassandraTombstonesTooManyMajor with the same cassandra_cluster, namespace, and pod labels

CassandraTombstonesTooManyMajor

CassandraTombstonesTooManyWarning with the same cassandra_cluster, namespace, and pod labels

CinderServiceOutage

CinderServiceDown with the same binary label

KubeDaemonSetOutage

  • LibvirtExporterTargetsOutage

  • TungstenFabricControllerOutage

  • TungstenFabricControllerTargetsOutage

  • TungstenFabricVrouterOutage

  • TungstenFabricVrouterTargetsOutage

And other alerts described in Alert inhibition rules for MOSK clusters.

KubeDeploymentOutage

  • RabbitMQExporterTargetDown for the particular OpenStack service

  • RabbitMQOperatorTargetDown

  • TelegrafOpenstackTargetDown

And other alerts described in Alert inhibition rules for MOSK clusters.

KubeStatefulSetOutage

  • CassandraClusterTargetDown Since 23.3

  • CassandraClusterTargetsOutage Before 23.3

  • KafkaClusterTargetDown Since 23.3

  • KafkaClusterTargetsOutage Before 23.3

  • MariadbClusterDown

  • MariadbExporterTargetDown Since 23.3

  • MariadbExporterClusterTargetsOutage Before 23.3

  • MemcachedClusterDown

  • MemcachedExporterTargetDown Since 23.3

  • MemcachedExporterClusterTargetsOutage Before 23.3

  • OpenstackPowerDNSTargetDown Since 24.3

  • OpenstackPowerDNSProbeFailure Since 24.3

  • RabbitMQDown for the particular OpenStack service

  • ZooKeeperClusterTargetDown Since 23.3

  • ZooKeeperClusterTargetsOutage Before 23.3

And other alerts described in Alert inhibition rules for MOSK clusters.

LibvirtExporterTargetsOutage

LibvirtExporterTargetDown

MemcachedConnectionsNoneMajor

MemcachedConnectionsNoneWarning with the same namespace label

NeutronAgentOutage

NeutronAgentDown with the same binary label

NodeDown

  • Alerts with the same node label:

    • LibvirtExporterTargetDown

    • OpenstackPowerDNSTargetDown Since 24.3

    Since 23.3:

    • CassandraClusterTargetDown

    • KafkaClusterTargetDown

    • MariadbExporterTargetDown

    • MemcachedExporterTargetDown

    • OpenstackCloudproberTargetDown

    • RabbitMQOperatorTargetDown

    • RabbitMQExporterTargetDown

    • RedisClusterTargetDown

    • ZooKeeperClusterTargetDown

    And other alerts described in Alert inhibition rules for MOSK clusters.

NovaServiceOutage

NovaServiceDown with the same binary label

OpenstackPowerDNSProbeFailure Since 24.3

OpenstackPowerDNSQueryDurationHigh with the same target_name, target_type, and protocol

OpenstackSSLCertExpirationHigh

OpenstackSSLCertExpirationMedium with the same namespace and service_name labels

OsDplSSLCertExpirationHigh

OsDplSSLCertExpirationMedium with the same identifier label

TungstenFabricControllerOutage

TungstenFabricControllerDown

TungstenFabricVrouterOutage

TungstenFabricVrouterDown

TungstenFabricVrouterTargetsOutage

TungstenFabricVrouterTargetDown

Alert inhibition rules for MOSK clusters

Alert

Inhibits and rules

cAdvisorTargetsOutage

cAdvisorTargetDown

CalicoTargetsOutage

CalicoTargetDown

CephClusterFullCritical

CephClusterFullWarning

CephClusterHealthCritical

CephClusterHealthWarning

CephOSDDiskNotResponding

CephOSDDown with the same rook_cluster label Before MCC 2.25.0 (17.0.0 and 16.0.0)

CephOSDDiskUnavailable

CephOSDDown with the same rook_cluster label Before MCC 2.25.0 (17.0.0 and 16.0.0)

CephOSDNodeDown Since MCC 2.25.0 (17.0.0 and 16.0.0)

With the same node label:

  • CephOSDDiskNotResponding

  • CephOSDDiskUnavailable

CephOSDPgNumTooHighCritical

CephOSDPgNumTooHighWarning

DockerSwarmServiceReplicasFlapping

DockerSwarmServiceReplicasDown with the same service_id, service_mode, and service_name labels

DockerSwarmServiceReplicasOutage

DockerSwarmServiceReplicasDown with the same service_id, service_mode, and service_name labels

etcdDbSizeCritical

etcdDbSizeMajor with the same job and instance labels

etcdHighNumberOfFailedGRPCRequestsCritical

etcdHighNumberOfFailedGRPCRequestsWarning with the same grpc_method, grpc_service, job, and instance labels

ExternalEndpointDown

ExternalEndpointTCPFailure with the same instance and job labels

FileDescriptorUsageMajor

FileDescriptorUsageWarning with the same node label

FluentdTargetsOutage

FluentdTargetDown

KubeAPICertExpirationHigh

KubeAPICertExpirationMedium

KubeAPIErrorsHighMajor

KubeAPIErrorsHighWarning with the same instance label

KubeAPIOutage

KubeAPIDown

KubeAPIResourceErrorsHighMajor

KubeAPIResourceErrorsHighWarning with the same instance, resource, and subresource labels

KubeClientCertificateExpirationInOneDay Removed in MCC 2.28.0 (17.3.0 and 16.3.0)

KubeClientCertificateExpirationInSevenDays with the same instance label

KubeDaemonSetOutage

  • CalicoTargetsOutage

  • KubeDaemonSetRolloutStuck with the same daemonset and namespace labels

  • FluentdTargetsOutage

  • NodeExporterTargetsOutage

  • TelegrafSMARTTargetsOutage

KubeDeploymentOutage

  • KubeDeploymentReplicasMismatch with the same deployment and namespace labels

  • GrafanaTargetDown

  • KubeDNSTargetsOutage Removed in MCC 2.25.0 (17.0.0 and 16.0.0)

  • KubernetesMasterAPITargetsOutage

  • KubeStateMetricsTargetDown

  • PrometheusEsExporterTargetDown

  • PrometheusMsTeamsTargetDown

  • PrometheusRelayTargetDown

  • ServiceNowWebhookReceiverTargetDown

  • SfNotifierTargetDown

  • TelegrafDockerSwarmTargetDown

  • TelegrafOpenstackTargetDown

KubeJobFailed

KubePodsNotReady for created_by_kind=Job and with the same created_by_name label (removed in Container Cloud 2.25.0, Cluster releases 17.0.0 and 16.0.0)

KubeletTargetsOutage

KubeletTargetDown

KubePersistentVolumeUsageCritical

With the same namespace and persistentvolumeclaim labels:

  • KubePersistentVolumeFullInFourDays

  • OpenSearchStorageUsageCritical
    Since MCC 2.26.0 (17.1.0 and 16.1.0)
  • OpenSearchStorageUsageMajor
    Since MCC 2.26.0 (17.1.0 and 16.1.0)

KubePodsCrashLooping

KubePodsRegularLongTermRestarts with the same created_by_name, created_by_kind, and namespace labels

KubeStatefulSetOutage

  • Alerts with the same namespace and statefulset labels:

    • KubeStatefulSetUpdateNotRolledOut

    • KubeStatefulSetReplicasMismatch

  • AlertmanagerTargetDown Since MCC 2.25.0 (17.0.0 and 16.0.0)

  • AlertmanagerClusterTargetDown Before MCC 2.25.0 (17.0.0 and 16.0.0)

  • ElasticsearchExporterTargetDown

  • FluentdTargetsOutage

  • OpenSearchClusterStatusCritical

  • PostgresqlReplicaDown

  • PostgresqlTargetDown Since MCC 2.25.0 (17.0.0 and 16.0.0)

  • PostgresqlTargetsOutage Before MCC 2.25.0 (17.0.0 and 16.0.0)

  • PrometheusEsExporterTargetDown

  • PrometheusServerTargetDown Since MCC 2.25.0 (17.0.0 and 16.0.0)

  • PrometheusServerTargetsOutage Before MCC 2.25.0 (17.0.0 and 16.0.0)

MCCLicenseExpirationHigh

MCCLicenseExpirationMedium

MCCSSLCertExpirationHigh

MCCSSLCertExpirationMedium with the same namespace and service_name labels

MCCSSLProbesServiceTargetOutage

MCCSSLProbesEndpointTargetOutage with the same namespace and service_name labels

MKEAPICertExpirationHigh

MKEAPICertExpirationMedium

MKEAPIOutage

MKEAPIDown

MKEMetricsEngineTargetsOutage

MKEMetricsEngineTargetDown

MKENodeDiskFullCritical

MKENodeDiskFullWarning with the same node label

NodeDown

  • KubeDaemonSetMisScheduled for the following DaemonSets (removed in Container Cloud 2.27.0, Cluster releases 17.2.0 and 16.2.0):

    • cadvisor

    • csi-cephfsplugin

    • csi-cinder-nodeplugin

    • csi-rbdplugin

    • fluentd-logs

    • local-volume-provisioner

    • metallb-speaker

    • openstack-ccm

    • prometheus-libvirt-exporter

    • prometheus-node-exporter

    • rook-discover

    • telegraf-ds-smart

    • ucp-metrics

  • KubeDaemonSetRolloutStuck for the calico-node and ucp-nvidia-device-plugin DaemonSets

  • For resource=nodes:

    • KubeAPIResourceErrorsHighMajor

    • KubeAPIResourceErrorsHighWarning

  • Alerts with the same node label:

    • cAdvisorTargetDown

    • CalicoTargetDown

    • FluentdTargetDown

    • KubeletDown

    • KubeletTargetDown

    • KubeNodeNotReady

    • LibvirtExporterTargetDown

    • MKEMetricsEngineTargetDown

    • MKENodeDown

    • NodeExporterTargetDown

    • TelegrafSMARTTargetDown

    Since MCC 2.25.0 (Cluster releases 17.0.0 and 16.0.0)`:

    • AlertmanagerTargetDown

    • CephClusterTargetDown

    • etcdTargetDown

    • GrafanaTargetDown

    • HelmControllerTargetDown

    • KubeAPIDown

    • MCCCacheTargetDown

    • MCCControllerTargetDown

    • MCCProviderTargetDown

    • MKEAPIDown

    • PostgresqlTargetDown

    • PrometheusMsTeamsTargetDown

    • PrometheusRelayTargetDown

    • PrometheusServerTargetDown

    • ServiceNowWebhookReceiverTargetDown

    • SfNotifierTargetDown

    • TelegrafDockerSwarmTargetDown

    • TelemeterClientTargetDown

    • TelemeterServerFederationTargetDown

    • TelemeterServerTargetDown

NodeExporterTargetsOutage

NodeExporterTargetDown

OpenSearchClusterStatusCritical

  • OpenSearchClusterStatusWarning and OpenSearchNumberOfUnassignedShards (removed in Container Cloud 2.27.0, Cluster releases 17.2.0 and 16.2.0) with the same cluster label

  • For created_by_name=~"elasticsearch-curator-.":

    • KubeJobFailed

    • KubePodsNotReady (removed in Container Cloud 2.27.0, Cluster releases 17.0.0 and 16.0.0)

OpenSearchClusterStatusWarning
Since MCC 2.26.0 (17.1.0 and 16.1.0)
  • OpenSearchNumberOfUnassignedShards with the same cluster label (removed in Container Cloud 2.27.0, Cluster releases 17.2.0 and 16.2.0)

OpenSearchHeapUsageCritical

OpenSearchHeapUsageWarning with the same cluster and name labels

OpenSearchStorageUsageCritical
Since MCC 2.26.0 (17.1.0 and 16.1.0)

KubePersistentVolumeFullInFourDays and OpenSearchStorageUsageMajor with the same namespace and persistentvolumeclaim labels

OpenSearchStorageUsageMajor
Since MCC 2.26.0 (17.1.0 and 16.1.0)

KubePersistentVolumeFullInFourDays with the same namespace and persistentvolumeclaim labels

PostgresqlPatroniClusterUnlocked

With the same cluster and namespace labels:

  • PostgresqlReplicationNonStreamingReplicas

  • PostgresqlReplicationPaused

PostgresqlReplicaDown

  • Alerts with the same cluster and namespace labels:

    • PostgresqlReplicationNonStreamingReplicas

    • PostgresqlReplicationPaused

    • PostgresqlReplicationSlowWalApplication

    • PostgresqlReplicationSlowWalDownload

    • PostgresqlReplicationWalArchiveWriteFailing

PrometheusErrorSendingAlertsMajor

PrometheusErrorSendingAlertsWarning with the same alertmanager and pod labels

SystemDiskFullMajor

SystemDiskFullWarning with the same device, mountpoint, and node labels

SystemDiskInodesFullMajor

SystemDiskInodesFullWarning with the same device, mountpoint, and node labels

SystemLoadTooHighCritical

SystemLoadTooHighWarning with the same node label

SystemMemoryFullMajor

SystemMemoryFullWarning with the same node label

SSLCertExpirationHigh

SSLCertExpirationMedium with the same instance label

TelegrafSMARTTargetsOutage

TelegrafSMARTTargetDown

TelemeterServerTargetDown

TelemeterServerFederationTargetDown