Documentation Portal

etcd

etcd¶

This section describes the alerts for the etcd service.

EtcdRequestFailureTooHigh
EtcdInstanceNoLeader
EtcdServiceDownMinor
EtcdServiceDownMajor
EtcdServiceOutage

EtcdRequestFailureTooHigh¶

Severity	Minor
Summary	More than 1% of HTTP `{{ $labels.method }}` requests to the etcd API failed on the `{{ $labels.instance }}` instance.
Raise condition	`sum by(method) (rate(etcd_http_failed_total[5m])) / sum by(method) (rate(etcd_http_received_total[5m])) > 0.01`
Description	Raises when the total percentage rate of failed HTTP requests from the client to etcd is higer than 1%. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the etcd service status on the affected node using `systemctl status etcd`. Inspect the etcd service logs using `journalctl -u etcd`.
Tuning	For example, to change the threshold to `2%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: EtcdRequestFailureTooHigh: if: >- sum by(method) (rate(etcd_http_failed_total[5m])) / sum\ by(method) (rate(etcd_http_received_total[5m])) > 0.02 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

EtcdInstanceNoLeader¶

Severity	Major
Summary	The etcd `{{ $labels.instance }}` instance has no leader.
Raise condition	`etcd_server_has_leader != 1`
Description	Raises when the etcd server reports that it has no leader.
Troubleshooting	Verify all etcd services on the `ctl` nodes: Verify the etcd service status using `systemctl status etcd`. Inspect the etcd service logs using `journalctl -u etcd`.
Tuning	Not required

EtcdServiceDownMinor¶

Severity	Minor
Summary	The etcd `{{ $labels.instance }}` instance is down for 2 minutes.
Raise condition	`up{job='etcd'} == 0`
Description	Raises when Prometheus fails to scrape the etcd target. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the availability of the etcd target on the `mon` nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI. Verify the etcd service status on the affected nodes using `systemctl status etcd` and `journalctl -u etcd`.
Tuning	Not required

EtcdServiceDownMajor¶

Severity	Major
Summary	More than 30% of etcd instances are down for 2 minutes.
Raise condition	`count(up{job='etcd'} == 0) > count(up{job='etcd'}) * 0.3`
Description	Raises when Prometheus fails to scrape more than 30% of etcd targets. Inspect the `EtcdServiceDownMinor` alerts for the host names of the affected nodes.
Troubleshooting	Verify the availability of the etcd target on the `mon` nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI. Verify the etcd service status on the affected nodes using `systemctl status etcd` and `journalctl -u etcd`.
Tuning	Not required

EtcdServiceOutage¶

Severity	Critical
Summary	All etcd services within the cluster are down.
Raise condition	`count(up{job='etcd'} == 0) == count(up{job='etcd'})`
Description	Raises when Prometheus fails to scrape all etcd targets. Inspect the `EtcdServiceDownMinor` alerts for the host names of the affected nodes.
Troubleshooting	Verify the availability of the etcd target on the `mon` nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI. Verify the etcd service status on the affected nodes using `systemctl status etcd` and `journalctl -u etcd`.
Tuning	Not required

updated: 2025-01-10 08:56

Calico

View Previous Section

Kubernetes

View Next Section