etcd
This section describes the alerts for the etcd service.
EtcdRequestFailureTooHigh
Severity |
Minor |
Summary |
More than 1% of HTTP {{ $labels.method }} requests to the etcd API
failed on the {{ $labels.instance }} instance. |
Raise condition |
sum by(method) (rate(etcd_http_failed_total[5m])) / sum by(method)
(rate(etcd_http_received_total[5m])) > 0.01 |
Description |
Raises when the total percentage rate of failed HTTP requests from the
client to etcd is higer than 1%. The host label in the raised alert
contains the host name of the affected node. |
Troubleshooting |
- Verify the etcd service status on the affected node using
systemctl status etcd .
- Inspect the etcd service logs using
journalctl -u etcd .
|
Tuning |
For example, to change the threshold to 2% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
EtcdRequestFailureTooHigh:
if: >-
sum by(method) (rate(etcd_http_failed_total[5m])) / sum\
by(method) (rate(etcd_http_received_total[5m])) > 0.02
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
EtcdInstanceNoLeader
Severity |
Major |
Summary |
The etcd {{ $labels.instance }} instance has no leader. |
Raise condition |
etcd_server_has_leader != 1 |
Description |
Raises when the etcd server reports that it has no leader. |
Troubleshooting |
Verify all etcd services on the ctl nodes:
- Verify the etcd service status using
systemctl status etcd .
- Inspect the etcd service logs using
journalctl -u etcd .
|
Tuning |
Not required |
EtcdServiceDownMinor
Severity |
Minor |
Summary |
The etcd {{ $labels.instance }} instance is down for 2 minutes. |
Raise condition |
up{job='etcd'} == 0 |
Description |
Raises when Prometheus fails to scrape the etcd target. The host
label in the raised alert contains the host name of the affected node. |
Troubleshooting |
- Verify the availability of the etcd target on the
mon nodes. To
obtain the address of a target, navigate to
Status -> Targets -> etcd in the Prometheus web UI.
- Verify the etcd service status on the affected nodes using
systemctl status etcd and journalctl -u etcd .
|
Tuning |
Not required |
EtcdServiceDownMajor
Severity |
Major |
Summary |
More than 30% of etcd instances are down for 2 minutes. |
Raise condition |
count(up{job='etcd'} == 0) > count(up{job='etcd'}) * 0.3 |
Description |
Raises when Prometheus fails to scrape more than 30% of etcd targets.
Inspect the EtcdServiceDownMinor alerts for the host names of the
affected nodes. |
Troubleshooting |
- Verify the availability of the etcd target on the
mon nodes. To
obtain the address of a target, navigate to
Status -> Targets -> etcd in the Prometheus web UI.
- Verify the etcd service status on the affected nodes using
systemctl status etcd and journalctl -u etcd .
|
Tuning |
Not required |
EtcdServiceOutage
Severity |
Critical |
Summary |
All etcd services within the cluster are down. |
Raise condition |
count(up{job='etcd'} == 0) == count(up{job='etcd'}) |
Description |
Raises when Prometheus fails to scrape all etcd targets. Inspect the
EtcdServiceDownMinor alerts for the host names of the affected
nodes. |
Troubleshooting |
- Verify the availability of the etcd target on the
mon nodes. To
obtain the address of a target, navigate to
Status -> Targets -> etcd in the Prometheus web UI.
- Verify the etcd service status on the affected nodes using
systemctl status etcd and journalctl -u etcd .
|
Tuning |
Not required |