NGINX

NGINX

This section describes the alerts for the NGINX service.


NginxServiceDown

Severity Minor
Summary The NGINX service on the {{ $labels.host }} node is down.
Raise condition nginx_up != 1
Description Raises when the NGINX service on a host node does not respond to Telegraf, typically indicating that the NGINX service is not running on that node for 1 minute. The host label in the raised alert contains the name of the affected node.
Troubleshooting
  • Verify the NGINX status on the affected node using service nginx status.
  • If NGINX is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required

NginxServiceOutage

Severity Critical
Summary All NGINX processes within the {{ $labels.cluster }} cluster are down.
Raise condition count(label_replace(nginx_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(nginx_up == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)
Description Raises when all NGINX services across a cluster do not respond to Telegraf, typically indicating deployment or configuration issues. The cluster label in the raised alert contains the prefix of a cluster, for example, ctl, dbs, or mon.
Troubleshooting Inspect the Telegraf logs on the affected node using journalctl -u telegraf.
Tuning Not required

NginxDroppedIncomingConnections

Severity Minor
Summary NGINX drops {{ $value }} accepted connections per second for 5 minutes.
Raise condition irate(nginx_accepts[5m]) - irate(nginx_handled[5m]) > 0
Description Raises when NGINX has dropped the accepted connections for the last 5 minutes, indicating that NGINX does not handle every incoming connection, which may be caused by a resource or configuration limit. The host label contains the name of the affected node.
Troubleshooting Inspect the NGINX logs using journalctl -u nginx.
Tuning Not required