NGINX

NGINX

This section describes the alerts for the NGINX service.


NginxServiceDown

Severity

Minor

Summary

The NGINX service on the {{ $labels.host }} node is down.

Raise condition

nginx_up != 1

Description

Raises when the NGINX service on a host node does not respond to Telegraf, typically indicating that the NGINX service is not running on that node for 1 minute. The host label in the raised alert contains the name of the affected node.

Troubleshooting

  • Verify the NGINX status on the affected node using service nginx status.

  • If NGINX is up and running, inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

NginxServiceOutage

Severity

Critical

Summary

All NGINX processes within the {{ $labels.cluster }} cluster are down.

Raise condition

count(label_replace(nginx_up, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster) == count(label_replace(nginx_up == 0, "cluster", "$1", "host", "([^0-9]+).+")) by (cluster)

Description

Raises when all NGINX services across a cluster do not respond to Telegraf, typically indicating deployment or configuration issues. The cluster label in the raised alert contains the prefix of a cluster, for example, ctl, dbs, or mon.

Troubleshooting

Inspect the Telegraf logs on the affected node using journalctl -u telegraf.

Tuning

Not required

NginxDroppedIncomingConnections

Severity

Minor

Summary

NGINX drops {{ $value }} accepted connections per second for 5 minutes.

Raise condition

irate(nginx_accepts[5m]) - irate(nginx_handled[5m]) > 0

Description

Raises when NGINX has dropped the accepted connections for the last 5 minutes, indicating that NGINX does not handle every incoming connection, which may be caused by a resource or configuration limit. The host label contains the name of the affected node.

Troubleshooting

Inspect the NGINX logs using journalctl -u nginx.

Tuning

Not required