NGINX
This section describes the alerts for the NGINX service.
NginxServiceDown
Severity |
Minor |
Summary |
The NGINX service on the {{ $labels.host }} node is down. |
Raise condition |
nginx_up != 1 |
Description |
Raises when the NGINX service on a host node does not respond to
Telegraf, typically indicating that the NGINX service is not running on
that node for 1 minute. The host label in the raised alert contains
the name of the affected node. |
Troubleshooting |
- Verify the NGINX status on the affected node using
service nginx status .
- If NGINX is up and running, inspect the Telegraf logs on the affected
node using
journalctl -u telegraf .
|
Tuning |
Not required |
NginxServiceOutage
Severity |
Critical |
Summary |
All NGINX processes within the {{ $labels.cluster }} cluster are
down. |
Raise condition |
count(label_replace(nginx_up, "cluster", "$1", "host", "([^0-9]+).+"))
by (cluster) == count(label_replace(nginx_up == 0, "cluster", "$1",
"host", "([^0-9]+).+")) by (cluster) |
Description |
Raises when all NGINX services across a cluster do not respond to
Telegraf, typically indicating deployment or configuration issues. The
cluster label in the raised alert contains the prefix of a cluster,
for example, ctl , dbs , or mon . |
Troubleshooting |
Inspect the Telegraf logs on the affected node using
journalctl -u telegraf . |
Tuning |
Not required |
NginxDroppedIncomingConnections
Severity |
Minor |
Summary |
NGINX drops {{ $value }} accepted connections per second for 5
minutes. |
Raise condition |
irate(nginx_accepts[5m]) - irate(nginx_handled[5m]) > 0 |
Description |
Raises when NGINX has dropped the accepted connections for the last 5
minutes, indicating that NGINX does not handle every incoming
connection, which may be caused by a resource or configuration limit.
The host label contains the name of the affected node. |
Troubleshooting |
Inspect the NGINX logs using journalctl -u nginx . |
Tuning |
Not required |