Docker

Docker

This section describes the alerts for the Docker service.


DockerdProcessDown

Severity

Minor

Summary

The dockerd process on the {{ $labels.host }} node is down.

Raise condition

procstat_running{process_name="dockerd"} == 0

Description

Raises when Telegraf cannot find running dockerd processes on a host. The host label in the raised alert contains the name of the affected node.

Troubleshooting

  • Verify Docker status by running systemctl status docker on the affected node.

  • Inspect Docker logs using journalctl -u docker.

Tuning

Not required

DockerServiceOutage

Severity

Critical

Summary

All dockerd processes within the {{ $labels.cluster }} cluster are down.

Raise condition

count(label_replace(procstat_running{process_name="dockerd"}, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(procstat_running{process_name="dockerd"} == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)

Description

Raises when Telegraf cannot find running dockerd processes on all hosts of a cluster. The cluster label in the raised alert is a set of nodes with the same host name prefix, for example, mon or cid.

Troubleshooting

  • Inspect the DockerdProcessDown alerts for the host names of the affected nodes.

  • Verify the Docker service status using service docker status.

  • Inspect Docker logs using journalctl -u docker.

Tuning

Not required

DockerService {{ camel_case_name }} ReplicasDownMinor

Severity

Minor

Summary

More than 30% of the Docker Swarm {{ full_service_name }} service replicas are down for 2 minutes.

Raise condition

  • In 2019.2.8 and prior: {{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} * {{ monitoring.replicas_failed_warning_threshold_percent }}

  • In 2019.2.9 and newer: min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) / min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{ 1 - monitoring.replicas_failed_warning_threshold_percent }}

Description

A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services.

Raises when the cluster has more than 30% of unavailable replicas. The service_name label in the raised alert contains the Docker service name.

Troubleshooting

Run docker service ps <service_name> on any node of the affected cluster to verify the Docker service.

Tuning

Not required

DockerService {{ camel_case_name }} ReplicasDownMajor

Severity

Major

Summary

More than 60% of the Docker Swarm {{ full_service_name }} service replicas are down for 2 minutes.

Raise condition

  • In 2019.2.8 and prior: {{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} * {{ monitoring.replicas_failed_critical_threshold_percent }}

  • In 2019.2.9 and newer: min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) / min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{ 1 - monitoring.replicas_failed_critical_threshold_percent }}

Description

A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services.

Raises when the cluster has more than 60% of unavailable service replicas. The service_name label in the raised alert contains the Docker service name.

Troubleshooting

Run docker service ps <service_name> on any node of the affected cluster to verify the Docker service.

Tuning

Not required

DockerService {{ camel_case_name }} Outage

Severity

Critical

Summary

All Docker Swarm {{ full_service_name }} replicas are down for 2 minutes.

Raise condition

  • In 2019.2.5 and prior: docker_swarm_tasks_running{{ '{' + label_selector + '}' }} == 0 or absent (docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1

  • In 2019.2.6-2019.2.8: docker_swarm_tasks_desired{{ '{' + label_selector + '}' }} > 0 and (docker_swarm_tasks_running{{ '{' + label_selector + '}' }} == 0 or absent (docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)

  • In 2019.2.9 and newer: min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) > 0 and (min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 0 or absent(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)

Description

A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services.

Raises when the cluster has no available service replicas. The service_name label in the raised alert contains the Docker service name.

Troubleshooting

Run docker service ps <service_name> on any node of the affected cluster to verify the Docker service.

Tuning

Not required

DockerdServiceReplicaFlapping

Available starting from the 2019.2.6 maintenance update

Severity

Critical

Summary

The Docker Swarm {{ $labels.service_name }} service replica is flapping for 15 minutes.

Raise condition

sum(changes(docker_swarm_tasks_running[10m])) by (service_name) > 0

Description

Raises when the container with the service cannot start properly in the Docker Swarm cluster and stops after the start, meaning that the service may be unavailable. However, the service unavailability alert may not fire because the container is being constantly restarted.

Troubleshooting

Inspect the failed container logs using docker logs <container_id>.

Tuning

Not required