etcd

etcd

This section describes the alerts for the etcd service.


EtcdRequestFailureTooHigh

Severity Minor
Summary More than 1% of HTTP {{ $labels.method }} requests to the etcd API failed on the {{ $labels.instance }} instance.
Raise condition sum by(method) (rate(etcd_http_failed_total[5m])) / sum by(method) (rate(etcd_http_received_total[5m])) > 0.01
Description Raises when the total percentage rate of failed HTTP requests from the client to etcd is higer than 1%. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the etcd service status on the affected node using systemctl status etcd.
  • Inspect the etcd service logs using journalctl -u etcd.
Tuning

For example, to change the threshold to 2%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            EtcdRequestFailureTooHigh:
              if: >-
                sum by(method) (rate(etcd_http_failed_total[5m])) / sum\
                by(method) (rate(etcd_http_received_total[5m])) > 0.02
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

EtcdInstanceNoLeader

Severity Major
Summary The etcd {{ $labels.instance }} instance has no leader.
Raise condition etcd_server_has_leader != 1
Description Raises when the etcd server reports that it has no leader.
Troubleshooting

Verify all etcd services on the ctl nodes:

  • Verify the etcd service status using systemctl status etcd.
  • Inspect the etcd service logs using journalctl -u etcd.
Tuning Not required

EtcdServiceDownMinor

Severity Minor
Summary The etcd {{ $labels.instance }} instance is down for 2 minutes.
Raise condition up{job='etcd'} == 0
Description Raises when Prometheus fails to scrape the etcd target. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the availability of the etcd target on the mon nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI.
  • Verify the etcd service status on the affected nodes using systemctl status etcd and journalctl -u etcd.
Tuning Not required

EtcdServiceDownMajor

Severity Major
Summary More than 30% of etcd instances are down for 2 minutes.
Raise condition count(up{job='etcd'} == 0) > count(up{job='etcd'}) * 0.3
Description Raises when Prometheus fails to scrape more than 30% of etcd targets. Inspect the EtcdServiceDownMinor alerts for the host names of the affected nodes.
Troubleshooting
  • Verify the availability of the etcd target on the mon nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI.
  • Verify the etcd service status on the affected nodes using systemctl status etcd and journalctl -u etcd.
Tuning Not required

EtcdServiceOutage

Severity Critical
Summary All etcd services within the cluster are down.
Raise condition count(up{job='etcd'} == 0) == count(up{job='etcd'})
Description Raises when Prometheus fails to scrape all etcd targets. Inspect the EtcdServiceDownMinor alerts for the host names of the affected nodes.
Troubleshooting
  • Verify the availability of the etcd target on the mon nodes. To obtain the address of a target, navigate to Status -> Targets -> etcd in the Prometheus web UI.
  • Verify the etcd service status on the affected nodes using systemctl status etcd and journalctl -u etcd.
Tuning Not required