HAProxy

HAProxy

This section describes the alerts for the HAProxy service.


HaproxyServiceDown

Severity Minor
Summary The HAProxy service on the {{ $labels.host }} node is down.
Raise condition haproxy_up != 1
Description Raises when the HAProxy service on a node does not respond to Telegraf, typically meaning that the HAproxy process is in the DOWN state on that node. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Verify the HAProxy status by running systemctl status haproxy on the affected node.
  • If HAProxy is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required

HaproxyServiceDownMajor

Severity Major
Summary More than 50% of HAProxy services within the {{ $labels.cluster }} cluster are down.
Raise condition count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+") != 1) by (cluster) >= 0.5 * count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)
Description Raises when the HAProxy service does not respond to Telegraf on more than 50% of cluster nodes. The cluster label in the raised alert contains the cluster prefix, for example, ctl, dbs, or mon.
Troubleshooting
  • Inspect the HaproxyServiceDown alerts for the host names of the affected nodes.
  • Inspect dmesg and /var/log/kern.log.
  • Inspect the logs in /var/log/haproxy.log.
  • Inspect the Telegraf logs using journalctl -u telegraf.
Tuning Not required

HaproxyServiceOutage

Severity Critical
Summary All HAProxy services within the {{ $labels.cluster }} cluster are down.
Raise condition count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+") != 1) by (cluster) == count(label_replace(haproxy_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)
Description Raises when the HAProxy service does not respond to Telegraf on all nodes of a cluster, typically indicating deployment or configuration issues. The cluster label in the raised alert contains the cluster prefix, for example, ctl, dbs, or mon.
Troubleshooting
  • Inspect the HaproxyServiceDown alerts for the host names of the affected nodes.
  • Inspect dmesg and /var/log/kern.log.
  • Inspect the logs in /var/log/haproxy.log.
  • Inspect the Telegraf logs using journalctl -u telegraf.
Tuning Not required

HaproxyHTTPResponse5xxTooHigh

Severity Warning
Summary The average per-second rate of 5xx HTTP errors on the {{ $labels.host }} node for the {{ $labels.proxy }} back end is {{ $value }} (as measured over the last 2 minutes).
Raise condition rate(haproxy_http_response_5xx{sv="FRONTEND"}[2m]) > 1
Description Raises when the HTTP 5xx responses sent by HAProxy increased for the last 2 minutes, indicating a configuration issue with the HAProxy service or back-end servers within the cluster. The host label in the raised alert contains the host name of the affected node.
Troubleshooting Inspect the HAproxy logs by running journalctl -u haproxy on the affected node and verify the state of the back-end servers.
Tuning Not required

HaproxyBackendDown

Severity Minor
Summary The {{ $labels.proxy }} back end on the {{ $labels.host }} node is down.
Raise condition increase(haproxy_chkdown{sv="BACKEND"}[1m]) > 0
Description Raises when an internal HAProxy check for the back-end availability reported the back-end outage. The host and proxy labels in the raised alert contain the host name of the affected node and the service proxy name.
Troubleshooting
  • Inspect the HAProxy logs by running journalctl -u haproxy on the affected node.
  • Verify the state of the affected back-end server:
    • Verify that the server is responding and the back-end service is active and responsive.
    • Verify the state of the back-end service using an HTTP GET request, for example, curl -XGET http://ctl01:8888/. Typically, the 200 response code indicates the healthy state.
Tuning Not required

HaproxyBackendDownMajor

Severity Major
Summary More than 50% of {{ $labels.proxy }} back ends are down.
Raise condition
  • In 2019.2.10 and prior: 0.5 * avg(sum(haproxy_active_servers{type=""server""}) by (host, proxy) + sum(haproxy_backup_servers{type=""server""}) by (host, proxy)) by (proxy) >= avg(sum(haproxy_active_servers{type=""backend""}) by (host, proxy) + sum(haproxy_backup_servers{type=""backend""}) by (host, proxy)) by (proxy)
  • In 2019.2.11 and newer: avg(sum(haproxy_active_servers{type="server"}) by (host, proxy) + sum(haproxy_backup_servers{type="server"}) by (host, proxy)) by (proxy) - avg(sum(haproxy_active_servers{type="backend"}) by (host, proxy) + sum(haproxy_backup_servers{type="backend"}) by (host, proxy)) by (proxy) >= 0.5 * avg(sum(haproxy_active_servers{type="server"}) by (host, proxy) + sum(haproxy_backup_servers{type="server"}) by (host, proxy)) by (proxy)
Description Raises when at least half of the back-end servers (>=50%) used by the HAProxy service are in the DOWN state. The host and proxy labels in the raised alert contain the host name of the affected node and the service proxy name.
Troubleshooting
  • Inspect the HAProxy logs by running journalctl -u haproxy on the affected node.
  • Verify the state of the affected back-end server:
    • Verify that the server is responding and the back-end service is active and responsive.
    • Verify the state of the back-end service using an HTTP GET request, for example, curl -XGET http://ctl01:8888/. Typically, the 200 response code indicates the healthy state.
Tuning Not required

HaproxyBackendOutage

Severity Critical
Summary All {{ $labels.proxy }} back ends are down.
Raise condition max(haproxy_active_servers{sv=""BACKEND""}) by (proxy) + max(haproxy_backup_servers{sv=""BACKEND""}) by (proxy) == 0
Description Raises when all back-end servers used by the HAProxy service across the cluster are not available to process the requests proxied by HAProxy, typically indicating deployment or configuration issues. The proxy label in the raised alert contains the service proxy name.
Troubleshooting
  • Verify the affected back ends.
  • Inspect the HAProxy logs by running journalctl -u haproxy on the affected node.
  • Inspect Telegraf logs by running journalctl -u telegraf on the affected node.
Tuning Not required