Prometheus
This section describes the alerts for the Prometheus service.
PrometheusTargetDown
Severity |
Critical |
Summary |
The Prometheus target for the {{ $labels.job }} job on the
{{ $labels.host or $labels.instance }} node is down for 2 minutes. |
Raise condition |
up != 1
|
Description |
Raises when Prometheus fails to scrape a target for 2 minutes. The
reasons depend on the target type. For example, Telegraf-related or
connectivity issues, Fluentd misconfiguration, issues with libvirt or
JMX Exporter. |
Troubleshooting |
Depending on the target type:
Inspect the Telegraf logs using journalctl -u telegraf .
Inspect the Fluentd logs using journalctl -u td-agent .
Inspect the libvirt-exporter logs using
journalctl -u libvirt-exporter .
Inspect the jmx-exporter logs using
journalctl -u jmx-exporter .
|
Tuning |
Not required |
PrometheusTargetSamplesOrderWarning
Removed since the 2019.2.4 maintenance update.
Severity |
Warning |
Summary |
{{ $value }} Prometheus samples on the {{$labels.instance }}
instance are out of order (as measured over the last minute).
|
Raise condition |
increase(prometheus_target_scrapes_sample_out_of_order_total[1m]) > 0
|
Description |
Raises when Prometheus observes time series samples with a wrong order.
Warning
The alert has been removed starting from the 2019.2.4
maintenance update. For the existing MCP deployments, disable this
alert.
|
Tuning |
Disable the alert as described in Manage alerts. |
PrometheusTargetSamplesBoundsWarning
Removed since the 2019.2.4 maintenance update.
Severity |
Warning |
Summary |
{{ $value }} Prometheus samples on the {{$labels.instance }}
instance have time stamps out of bounds (as measured over the last
minute).
|
Raise condition |
increase(prometheus_target_scrapes_sample_out_of_bounds_total[1m]) >
0
|
Description |
Raises when Prometheus observes samples with time stamps greater than
the current time.
Warning
The alert has been removed starting from the 2019.2.4
maintenance update. For the existing MCP deployments, disable this
alert.
|
Tuning |
Disable the alert as described in Manage alerts. |
PrometheusTargetSamplesDuplicateWarning
Removed since the 2019.2.4 maintenance update.
Severity |
Warning |
Summary |
{{ $value }} Prometheus samples on the {{$labels.instance }}
instance have duplicate time stamps (as measured over the last minute).
|
Raise condition |
increase(prometheus_target_scrapes_sample_duplicate_timestamp_total
[1m]) > 0
|
Description |
Raises when Prometheus observes samples with duplicated time stamps.
Warning
The alert has been removed starting from the 2019.2.4
maintenance update. For the existing MCP deployments, disable this
alert.
|
Tuning |
Disable the alert as described in Manage alerts. |
PrometheusDataIngestionWarning
Removed since the 2019.2.4 maintenance update.
Severity |
Warning |
Summary |
The Prometheus service writes on the {{$labels.instance }} instance
do not keep up with the data ingestion speed for 10 minutes. |
Raise condition |
prometheus_local_storage_rushed_mode != 0
|
Description |
Raises when the Prometheus service writes do not keep up with the data
ingestion speed for 10 minutes.
Warning
The alert is deprecated for Prometheus versions newer than
1.7 and has been removed starting from the 2019.2.4 maintenance
update. For the existing MCP deployments, disable this alert.
|
Tuning |
Disable the alert as described in Manage alerts. |
PrometheusRemoteStorageQueueFullWarning
Severity |
Warning |
Summary |
The Prometheus remote storage queue on the {{$labels.instance }}
instance is 75% full for 2 minutes. |
Raise condition |
prometheus_remote_storage_queue_length /
prometheus_remote_storage_queue_capacity * 100 > 75
|
Description |
Raises when Prometheus remote write queue is 75% full. |
Troubleshooting |
Inspect the remote write service configuration in the remote_write
section in /srv/volumes/local/prometheus/config/prometheus.yml . |
Tuning |
For example, to change the warning threshold to 90% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
PrometheusRemoteStorageQueueFullWarning:
if: >-
prometheus_remote_storage_queue_length / \
prometheus_remote_storage_queue_capacity * 100 > 90
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
PrometheusRelayServiceDown
Severity |
Minor |
Summary |
The Prometheus Relay service on the {{$labels.host}} node is down
for 2 minutes. |
Raise condition |
procstat_running{process_name="prometheus-relay"} == 0
|
Description |
Raises when Telegraf cannot find running prometheus-relay processes
on any mtr host. |
Troubleshooting |
Verify the status of the Prometheus Relay service on the affected node
using service prometheus-relay status .
Inspect the Prometheus Relay logs on the affected node using
journalctl -u prometheus-relay .
|
Tuning |
Not required |
PrometheusRelayServiceDownMajor
Severity |
Major |
Summary |
More than 50% of Prometheus Relay services are down for 2 minutes. |
Raise condition |
count(procstat_running{process_name="prometheus-relay"} == 0) >= count
(procstat_running{process_name="prometheus-relay"}) * 0.5
|
Description |
Raises when Telegraf cannot find running prometheus-relay processes
on more than 50% of the mtr hosts. |
Troubleshooting |
Inspect the PrometheusRelayServiceDown alerts for the host names of
the affected nodes.
Verify the status of the Prometheus Relay service on the affected node
using service prometheus-relay status .
Inspect the Prometheus Relay logs on the affected node using
journalctl -u prometheus-relay .
|
Tuning |
Not required |
PrometheusRelayServiceOutage
Severity |
Critical |
Summary |
All Prometheus Relay services are down for 2 minutes. |
Raise condition |
count(procstat_running{process_name="prometheus-relay"} == 0) == count
(procstat_running{process_name="prometheus-relay"})
|
Description |
Raises when Telegraf cannot find running prometheus-relay processes
on all mtr hosts. |
Troubleshooting |
Inspect the PrometheusRelayServiceDown alerts for the host names of
the affected nodes.
Verify the status of the Prometheus Relay service on the affected node
using service prometheus-relay status .
Inspect the Prometheus Relay logs on the affected node using
journalctl -u prometheus-relay .
|
Tuning |
Not required |
PrometheusLTSServiceDown
Severity |
Minor |
Summary |
The Prometheus long-term storage service on the {{$labels.host}}
node is down for 2 minutes. |
Raise condition |
procstat_running{process_name="prometheus"} == 0
|
Description |
Raises when Telegraf cannot find running prometheus processes on
any mtr host. |
Troubleshooting |
|
Tuning |
Not required |
PrometheusLTSServiceDownMajor
Severity |
Major |
Summary |
More than 50% of the Prometheus long-term storage services are down
for 2 minutes. |
Raise condition |
count(procstat_running{process_name="prometheus"} == 0) >= count
(procstat_running{process_name="prometheus"}) * 0.5
|
Description |
Raises when Telegraf cannot find running prometheus processes on
more than 50% of the mtr hosts. |
Troubleshooting |
Inspect the PrometheusLTSServiceDown alerts for the host names of
the affected nodes.
Verify the status of the Prometheus service on the affected node
using service prometheus status .
Inspect the Prometheus logs on the affected node using
journalctl -u prometheus .
|
Tuning |
Not required |
PrometheusLTSServiceOutage
Severity |
Critical |
Summary |
All Prometheus long-term storage services are down for 2 minutes. |
Raise condition |
count(procstat_running{process_name="prometheus"} == 0) == count
(procstat_running{process_name="prometheus"})
|
Description |
Raises when Telegraf cannot find running prometheus processes on all
mtr hosts. |
Troubleshooting |
Inspect the PrometheusLTSServiceDown alerts for the host names of
the affected nodes.
Verify the status of the Prometheus service on the affected node
using service prometheus status .
Inspect the Prometheus logs on the affected node using
journalctl -u prometheus .
|
Tuning |
Not required |
PrometheusRuleEvaluationsFailed
Available starting from the 2019.2.7 maintenance update
Severity |
Warning |
Summary |
The Prometheus server for the {{ $labels.job }} job on the
{{ or $labels.host $labels.instance }} node has failed evaluations
for recording rules. Verify the rules state in the
Status/Rules section of the Prometheus web UI. |
Raise condition |
rate(prometheus_rule_evaluation_failures_total[5m]) > 0
|
Description |
Raises when the number of evaluation failures of Promethues recording
rules continuously increases for 10 minutes. The issue typically occurs
once you reload the Prometheus service after configuration changes. |
Troubleshooting |
Verify the syntax and metrics in the recently added custom Prometheus
recording rules in the cluster model. |
Tuning |
Not required |