etcd

etcd

This section describes the alerts for the etcd service.


EtcdRequestFailureTooHigh

Severity

Minor

Summary

More than 1% of HTTP {{ $labels.method }} requests to the etcd API failed on the {{ $labels.instance }} instance.

Raise condition

sum by(method) (rate(etcd_http_failed_total[5m])) / sum by(method) (rate(etcd_http_received_total[5m])) > 0.01

Description

Raises when the total percentage rate of failed HTTP requests from the client to etcd is higer than 1%. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the etcd service status on the affected node using systemctl status etcd.

  • Inspect the etcd service logs using journalctl -u etcd.

Tuning

For example, to change the threshold to 2%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            EtcdRequestFailureTooHigh:
              if: >-
                sum by(method) (rate(etcd_http_failed_total[5m])) / sum\
                by(method) (rate(etcd_http_received_total[5m])) > 0.02
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

EtcdInstanceNoLeader

Severity

Major

Summary

The etcd {{ $labels.instance }} instance has no leader.

Raise condition

etcd_server_has_leader != 1

Description

Raises when the etcd server reports that it has no leader.

Troubleshooting

Verify all etcd services on the ctl nodes:

  • Verify the etcd service status using systemctl status etcd.

  • Inspect the etcd service logs using journalctl -u etcd.

Tuning

Not required

EtcdServiceDownMinor

Severity

Minor

Summary

The etcd {{ $labels.instance }} instance is down for 2 minutes.

Raise condition

up{job='etcd'} == 0

Description

Raises when Prometheus fails to scrape the etcd target. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the availability of the etcd target on the mon nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI.

  • Verify the etcd service status on the affected nodes using systemctl status etcd and journalctl -u etcd.

Tuning

Not required

EtcdServiceDownMajor

Severity

Major

Summary

More than 30% of etcd instances are down for 2 minutes.

Raise condition

count(up{job='etcd'} == 0) > count(up{job='etcd'}) * 0.3

Description

Raises when Prometheus fails to scrape more than 30% of etcd targets. Inspect the EtcdServiceDownMinor alerts for the host names of the affected nodes.

Troubleshooting

  • Verify the availability of the etcd target on the mon nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI.

  • Verify the etcd service status on the affected nodes using systemctl status etcd and journalctl -u etcd.

Tuning

Not required

EtcdServiceOutage

Severity

Critical

Summary

All etcd services within the cluster are down.

Raise condition

count(up{job='etcd'} == 0) == count(up{job='etcd'})

Description

Raises when Prometheus fails to scrape all etcd targets. Inspect the EtcdServiceDownMinor alerts for the host names of the affected nodes.

Troubleshooting

  • Verify the availability of the etcd target on the mon nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI.

  • Verify the etcd service status on the affected nodes using systemctl status etcd and journalctl -u etcd.

Tuning

Not required