Mirantis Kubernetes Engine

This section describes the alerts for the Mirantis Kubernetes Engine (MKE) cluster, including the Docker Swarm service.

For troubleshooting guidelines, see Troubleshoot Mirantis Kubernetes Engine alerts.


DockerSwarmNetworkUnhealthy

Severity

Warning

Summary

Docker Swarm network unhealthy.

Description

The qLen size and NetMsg showed unexpected output for the last 10 minutes. Verify the NetworkDb Stats output for the qLen size and NetMsg using journalctl -d docker.

Note

For the DockerNetworkUnhealthy alert, StackLight collects metrics from logs. Therefore, this alert is available only if logging is enabled.

DockerSwarmNodeFlapping

Severity

Major

Summary

Docker Swarm node is flapping.

Description

The {{ $labels.node_name }} Docker Swarm node (ID: {{ $labels.node_id }}) state flapped more than 3 times for the last 10 minutes.

DockerSwarmServiceReplicasDown

Severity

Major

Summary

Docker Swarm replica is down.

Description

The {{ $labels.service_name }} Docker Swarm {{ $labels.service_mode }} service replica is down for 5 minutes.

DockerSwarmServiceReplicasFlapping

Severity

Major

Summary

Docker Swarm service is flapping.

Description

The {{ $labels.service_name }} Docker Swarm {{ $labels.service_mode }} service replica is flapping for 10 minutes.

DockerSwarmServiceReplicasOutage

Severity

Critical

Summary

Docker Swarm service outage.

Description

All {{ $labels.service_name }} Docker Swarm {{ $labels.service_mode }} service replicas are down for 2 minutes.

MKEAPICertExpirationHigh

Severity

Critical

Summary

MKE API certificate expires on {{ $value | humanizeTimestamp }}

Description

The SSL certificate for MKE API expires on {{ $value | humanizeTimestamp }}, less than 10 days are left.

MKEAPICertExpirationMedium

Severity

Major

Summary

MKE API certificate expires on {{ $value | humanizeTimestamp }}

Description

The SSL certificate for MKE API expires on {{ $value | humanizeTimestamp }}, less than 30 days are left.

MKEAPIDown

Severity

Critical

Summary

MKE API endpoint is down.

Description

The MKE API endpoint on the {{ $labels.node }} node is not accessible for the last 3 minutes.

MKEAPIOutage

Severity

Critical

Summary

MKE API is down.

Description

The MKE API (port 443) is not accessible for the last 1 minute.

MKEContainersUnhealthy

Severity

Major

Summary

MKE containers are Unhealthy.

Description

{{ $value }} MKE {{ $labels.name }} containers are Unhealthy.

MKEManagerAPITargetsOutage

Severity

Critical

Summary

MKE manager API cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of MKE manager API nodes.

MKEMetricsControllerTargetsOutage

Severity

Critical

Summary

MKE metrics controller Prometheus targets outage.

Description

Prometheus fails to scrape metrics from all MKE metrics controller endpoints.

MKEMetricsEngineTargetDown

Severity

Major

Summary

MKE metrics engine Prometheus target is down.

Description

Prometheus fails to scrape metrics from the MKE metrics engine on the {{ $labels.node }} node.

MKEMetricsEngineTargetsOutage

Severity

Critical

Summary

MKE metrics engine Prometheus targets outage.

Description

Prometheus fails to scrape metrics from the MKE metrics engine on all nodes.

MKENodeDiskFullCritical

Severity

Critical

Summary

MKE node disk is 95% full.

Description

The {{ $labels.node }} MKE node disk is 95% full.

MKENodeDiskFullWarning

Severity

Warning

Summary

MKE node disk is 85% full.

Description

The {{ $labels.node }} MKE node disk is 85% full.

MKENodeDown

Severity

Critical

Summary

MKE node is down.

Description

The {{ $labels.node }} MKE node is down.