This section describes the alerts for the Docker service.
Severity | Minor |
---|---|
Summary | The dockerd process on the {{ $labels.host }} node is down. |
Raise condition | procstat_running{process_name="dockerd"} == 0 |
Description | Raises when Telegraf cannot find running dockerd processes on a
host. The host label in the raised alert contains the name of
the affected node. |
Troubleshooting |
|
Tuning | Not required |
Severity | Critical |
---|---|
Summary | All dockerd processes within the {{ $labels.cluster }} cluster
are down. |
Raise condition | count(label_replace(procstat_running{process_name="dockerd"},
"cluster", "$1", "host", "([^0-9]+).+")) by (cluster) ==
count(label_replace(procstat_running{process_name="dockerd"} == 0,
"cluster", "$1", "host", "([^0-9]+).+")) by (cluster) |
Description | Raises when Telegraf cannot find running dockerd processes on all
hosts of a cluster. The cluster label in the raised alert is a set
of nodes with the same host name prefix, for example, mon or
cid . |
Troubleshooting |
|
Tuning | Not required |
Severity | Minor |
---|---|
Summary | More than 30% of the Docker Swarm {{ full_service_name }} service
replicas are down for 2 minutes. |
Raise condition |
|
Description | A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services. Raises when the cluster has more than 30% of unavailable replicas. The
|
Troubleshooting | Run docker service ps <service_name> on any node of the affected
cluster to verify the Docker service. |
Tuning | Not required |
Severity | Major |
---|---|
Summary | More than 60% of the Docker Swarm {{ full_service_name }} service
replicas are down for 2 minutes. |
Raise condition |
|
Description | A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services. Raises when the cluster has more than 60% of unavailable service
replicas. The |
Troubleshooting | Run docker service ps <service_name> on any node of the affected
cluster to verify the Docker service. |
Tuning | Not required |
Severity | Critical |
---|---|
Summary | All Docker Swarm {{ full_service_name }} replicas are down for 2 minutes. |
Raise condition |
|
Description | A generated set of alerts for each Docker Swarm service. Applicable only for the replicated Docker Swarm services. Raises when the cluster has no available service replicas. The
|
Troubleshooting | Run docker service ps <service_name> on any node of the affected
cluster to verify the Docker service. |
Tuning | Not required |
Available starting from the 2019.2.6 maintenance update
Severity | Critical |
---|---|
Summary | The Docker Swarm {{ $labels.service_name }} service replica is
flapping for 15 minutes. |
Raise condition | sum(changes(docker_swarm_tasks_running[10m])) by (service_name) > 0 |
Description | Raises when the container with the service cannot start properly in the Docker Swarm cluster and stops after the start, meaning that the service may be unavailable. However, the service unavailability alert may not fire because the container is being constantly restarted. |
Troubleshooting | Inspect the failed container logs using docker logs <container_id> . |
Tuning | Not required |