Docker

Docker

This section describes the alerts for the Docker service.


DockerdProcessDown

Severity Minor
Summary The dockerd process on the {{ $labels.host }} node is down.
Raise condition procstat_running{process_name="dockerd"} == 0
Description Raises when Telegraf cannot find running dockerd processes on a host. The host label in the raised alert contains the name of the affected node.
Troubleshooting
  • Verify Docker status by running systemctl status docker on the affected node.
  • Inspect Docker logs using journalctl -u docker.
Tuning Not required

DockerServiceOutage

Severity Critical
Summary All dockerd processes within the {{ $labels.cluster }} cluster are down.
Raise condition count(label_replace(procstat_running{process_name="dockerd"}, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(procstat_running{process_name="dockerd"} == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)
Description Raises when Telegraf cannot find running dockerd processes on all hosts of a cluster. The cluster label in the raised alert is a set of nodes with the same host name prefix, for example, mon or cid.
Troubleshooting
  • Inspect the DockerdProcessDown alerts for the host names of the affected nodes.
  • Verify the Docker service status using service docker status.
  • Inspect Docker logs using journalctl -u docker.
Tuning Not required

DockerService {{ camel_case_name }} ReplicasDownMinor

Severity Minor
Summary More than 30% of the Docker Swarm {{ full_service_name }} service replicas are down for 2 minutes.
Raise condition
  • In 2019.2.8 and prior: {{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} * {{ monitoring.replicas_failed_warning_threshold_percent }}
  • In 2019.2.9 and newer: min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) / min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{ 1 - monitoring.replicas_failed_warning_threshold_percent }}
Description

A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services.

Raises when the cluster has more than 30% of unavailable replicas. The service_name label in the raised alert contains the Docker service name.

Troubleshooting Run docker service ps <service_name> on any node of the affected cluster to verify the Docker service.
Tuning Not required

DockerService {{ camel_case_name }} ReplicasDownMajor

Severity Major
Summary More than 60% of the Docker Swarm {{ full_service_name }} service replicas are down for 2 minutes.
Raise condition
  • In 2019.2.8 and prior: {{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} * {{ monitoring.replicas_failed_critical_threshold_percent }}
  • In 2019.2.9 and newer: min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) / min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{ 1 - monitoring.replicas_failed_critical_threshold_percent }}
Description

A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services.

Raises when the cluster has more than 60% of unavailable service replicas. The service_name label in the raised alert contains the Docker service name.

Troubleshooting Run docker service ps <service_name> on any node of the affected cluster to verify the Docker service.
Tuning Not required

DockerService {{ camel_case_name }} Outage

Severity Critical
Summary All Docker Swarm {{ full_service_name }} replicas are down for 2 minutes.
Raise condition
  • In 2019.2.5 and prior: docker_swarm_tasks_running{{ '{' + label_selector + '}' }} == 0 or absent (docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1
  • In 2019.2.6-2019.2.8: docker_swarm_tasks_desired{{ '{' + label_selector + '}' }} > 0 and (docker_swarm_tasks_running{{ '{' + label_selector + '}' }} == 0 or absent (docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)
  • In 2019.2.9 and newer: min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) > 0 and (min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 0 or absent(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)
Description

A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services.

Raises when the cluster has no available service replicas. The service_name label in the raised alert contains the Docker service name.

Troubleshooting Run docker service ps <service_name> on any node of the affected cluster to verify the Docker service.
Tuning Not required

DockerdServiceReplicaFlapping

Available starting from the 2019.2.6 maintenance update

Severity Critical
Summary The Docker Swarm {{ $labels.service_name }} service replica is flapping for 15 minutes.
Raise condition sum(changes(docker_swarm_tasks_running[10m])) by (service_name) > 0
Description Raises when the container with the service cannot start properly in the Docker Swarm cluster and stops after the start, meaning that the service may be unavailable. However, the service unavailability alert may not fire because the container is being constantly restarted.
Troubleshooting Inspect the failed container logs using docker logs <container_id>.
Tuning Not required