Docker
This section describes the alerts for the Docker service.
DockerdProcessDown
Severity |
Minor |
Summary |
The dockerd process on the {{ $labels.host }} node is down. |
Raise condition |
procstat_running{process_name="dockerd"} == 0
|
Description |
Raises when Telegraf cannot find running dockerd processes on a
host. The host label in the raised alert contains the name of
the affected node. |
Troubleshooting |
|
Tuning |
Not required |
DockerServiceOutage
Severity |
Critical |
Summary |
All dockerd processes within the {{ $labels.cluster }} cluster
are down. |
Raise condition |
count(label_replace(procstat_running{process_name="dockerd"},
"cluster", "$1", "host", "([^0-9]+).+")) by (cluster) ==
count(label_replace(procstat_running{process_name="dockerd"} == 0,
"cluster", "$1", "host", "([^0-9]+).+")) by (cluster)
|
Description |
Raises when Telegraf cannot find running dockerd processes on all
hosts of a cluster. The cluster label in the raised alert is a set
of nodes with the same host name prefix, for example, mon or
cid . |
Troubleshooting |
Inspect the DockerdProcessDown alerts for the host names of the
affected nodes.
Verify the Docker service status using service docker status .
Inspect Docker logs using journalctl -u docker .
|
Tuning |
Not required |
DockerService {{ camel_case_name }} ReplicasDownMinor
Severity |
Minor |
Summary |
More than 30% of the Docker Swarm {{ full_service_name }} service
replicas are down for 2 minutes. |
Raise condition |
In 2019.2.8 and prior:
{{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{
'{' + label_selector + '}' }}) >= {{ service.deploy.replicas }} *
{{ monitoring.replicas_failed_warning_threshold_percent }}
In 2019.2.9 and newer:
min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) /
min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{
1 - monitoring.replicas_failed_warning_threshold_percent }}
|
Description |
A generated set of alerts for each Docker Swarm service. Applicable
only for the replicated Docker Swarm services.
Raises when the cluster has more than 30% of unavailable replicas. The
service_name label in the raised alert contains the Docker service
name.
|
Troubleshooting |
Run docker service ps <service_name> on any node of the affected
cluster to verify the Docker service. |
Tuning |
Not required |
DockerService {{ camel_case_name }} ReplicasDownMajor
Severity |
Major |
Summary |
More than 60% of the Docker Swarm {{ full_service_name }} service
replicas are down for 2 minutes. |
Raise condition |
In 2019.2.8 and prior:
{{ service.deploy.replicas }} - min(docker_swarm_tasks_running{{ '{'
+ label_selector + '}' }}) >= {{ service.deploy.replicas }} *
{{ monitoring.replicas_failed_critical_threshold_percent }}
In 2019.2.9 and newer:
min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) /
min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) <= {{
1 - monitoring.replicas_failed_critical_threshold_percent }}
|
Description |
A generated set of alerts for each Docker Swarm service. Applicable only
for the replicated Docker Swarm services.
Raises when the cluster has more than 60% of unavailable service
replicas. The service_name label in the raised alert contains the
Docker service name.
|
Troubleshooting |
Run docker service ps <service_name> on any node of the affected
cluster to verify the Docker service. |
Tuning |
Not required |
DockerService {{ camel_case_name }} Outage
Severity |
Critical |
Summary |
All Docker Swarm {{ full_service_name }} replicas are down for 2 minutes. |
Raise condition |
In 2019.2.5 and prior: docker_swarm_tasks_running{{
'{' + label_selector + '}' }} == 0 or absent
(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1
In 2019.2.6-2019.2.8: docker_swarm_tasks_desired{{
'{' + label_selector + '}' }} > 0 and (docker_swarm_tasks_running{{
'{' + label_selector + '}' }} == 0 or absent
(docker_swarm_tasks_running{{ '{' + label_selector + '}' }}) == 1)
In 2019.2.9 and newer:
min(docker_swarm_tasks_desired{{ '{' + label_selector + '}' }}) > 0
and (min(docker_swarm_tasks_running{{ '{' + label_selector + '}' }})
== 0 or absent(docker_swarm_tasks_running{{ '{' + label_selector + '}'
}}) == 1)
|
Description |
A generated set of alerts for each Docker Swarm service. Applicable only
for the replicated Docker Swarm services.
Raises when the cluster has no available service replicas. The
service_name label in the raised alert contains the Docker service
name.
|
Troubleshooting |
Run docker service ps <service_name> on any node of the affected
cluster to verify the Docker service. |
Tuning |
Not required |
DockerdServiceReplicaFlapping
Available starting from the 2019.2.6 maintenance update
Severity |
Critical |
Summary |
The Docker Swarm {{ $labels.service_name }} service replica is
flapping for 15 minutes. |
Raise condition |
sum(changes(docker_swarm_tasks_running[10m])) by (service_name) > 0
|
Description |
Raises when the container with the service cannot start properly in the
Docker Swarm cluster and stops after the start, meaning that the service
may be unavailable. However, the service unavailability alert may not
fire because the container is being constantly restarted. |
Troubleshooting |
Inspect the failed container logs using docker logs <container_id> . |
Tuning |
Not required |