This section describes the alerts for the Prometheus service.
Severity | Critical |
---|---|
Summary | The Prometheus target for the {{ $labels.job }} job on the
{{ $labels.host or $labels.instance }} node is down for 2 minutes. |
Raise condition | up != 1 |
Description | Raises when Prometheus fails to scrape a target for 2 minutes. The reasons depend on the target type. For example, Telegraf-related or connectivity issues, Fluentd misconfiguration, issues with libvirt or JMX Exporter. |
Troubleshooting | Depending on the target type:
|
Tuning | Not required |
Removed since the 2019.2.4 maintenance update.
Severity | Warning |
---|---|
Summary | {{ $value }} Prometheus samples on the {{$labels.instance }}
instance are out of order (as measured over the last minute). |
Raise condition | increase(prometheus_target_scrapes_sample_out_of_order_total[1m]) > 0 |
Description | Raises when Prometheus observes time series samples with a wrong order. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert. |
Tuning | Disable the alert as described in Manage alerts. |
Removed since the 2019.2.4 maintenance update.
Severity | Warning |
---|---|
Summary | {{ $value }} Prometheus samples on the {{$labels.instance }}
instance have time stamps out of bounds (as measured over the last
minute). |
Raise condition | increase(prometheus_target_scrapes_sample_out_of_bounds_total[1m]) >
0 |
Description | Raises when Prometheus observes samples with time stamps greater than the current time. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert. |
Tuning | Disable the alert as described in Manage alerts. |
Removed since the 2019.2.4 maintenance update.
Severity | Warning |
---|---|
Summary | {{ $value }} Prometheus samples on the {{$labels.instance }}
instance have duplicate time stamps (as measured over the last minute). |
Raise condition | increase(prometheus_target_scrapes_sample_duplicate_timestamp_total
[1m]) > 0 |
Description | Raises when Prometheus observes samples with duplicated time stamps. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert. |
Tuning | Disable the alert as described in Manage alerts. |
Removed since the 2019.2.4 maintenance update.
Severity | Warning |
---|---|
Summary | The Prometheus service writes on the {{$labels.instance }} instance
do not keep up with the data ingestion speed for 10 minutes. |
Raise condition | prometheus_local_storage_rushed_mode != 0 |
Description | Raises when the Prometheus service writes do not keep up with the data ingestion speed for 10 minutes. Warning The alert is deprecated for Prometheus versions newer than 1.7 and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert. |
Tuning | Disable the alert as described in Manage alerts. |
Severity | Warning |
---|---|
Summary | The Prometheus remote storage queue on the {{$labels.instance }}
instance is 75% full for 2 minutes. |
Raise condition | prometheus_remote_storage_queue_length /
prometheus_remote_storage_queue_capacity * 100 > 75 |
Description | Raises when Prometheus remote write queue is 75% full. |
Troubleshooting | Inspect the remote write service configuration in the remote_write
section in /srv/volumes/local/prometheus/config/prometheus.yml . |
Tuning | For example, to change the warning threshold to
|
Severity | Minor |
---|---|
Summary | The Prometheus Relay service on the {{$labels.host}} node is down
for 2 minutes. |
Raise condition | procstat_running{process_name="prometheus-relay"} == 0 |
Description | Raises when Telegraf cannot find running prometheus-relay processes
on any mtr host. |
Troubleshooting |
|
Tuning | Not required |
Severity | Major |
---|---|
Summary | More than 50% of Prometheus Relay services are down for 2 minutes. |
Raise condition | count(procstat_running{process_name="prometheus-relay"} == 0) >= count
(procstat_running{process_name="prometheus-relay"}) * 0.5 |
Description | Raises when Telegraf cannot find running prometheus-relay processes
on more than 50% of the mtr hosts. |
Troubleshooting |
|
Tuning | Not required |
Severity | Critical |
---|---|
Summary | All Prometheus Relay services are down for 2 minutes. |
Raise condition | count(procstat_running{process_name="prometheus-relay"} == 0) == count
(procstat_running{process_name="prometheus-relay"}) |
Description | Raises when Telegraf cannot find running prometheus-relay processes
on all mtr hosts. |
Troubleshooting |
|
Tuning | Not required |
Severity | Minor |
---|---|
Summary | The Prometheus long-term storage service on the {{$labels.host}}
node is down for 2 minutes. |
Raise condition | procstat_running{process_name="prometheus"} == 0 |
Description | Raises when Telegraf cannot find running prometheus processes on
any mtr host. |
Troubleshooting |
|
Tuning | Not required |
Severity | Major |
---|---|
Summary | More than 50% of the Prometheus long-term storage services are down for 2 minutes. |
Raise condition | count(procstat_running{process_name="prometheus"} == 0) >= count
(procstat_running{process_name="prometheus"}) * 0.5 |
Description | Raises when Telegraf cannot find running prometheus processes on
more than 50% of the mtr hosts. |
Troubleshooting |
|
Tuning | Not required |
Severity | Critical |
---|---|
Summary | All Prometheus long-term storage services are down for 2 minutes. |
Raise condition | count(procstat_running{process_name="prometheus"} == 0) == count
(procstat_running{process_name="prometheus"}) |
Description | Raises when Telegraf cannot find running prometheus processes on all
mtr hosts. |
Troubleshooting |
|
Tuning | Not required |
Available starting from the 2019.2.7 maintenance update
Severity | Warning |
---|---|
Summary | The Prometheus server for the {{ $labels.job }} job on the
{{ or $labels.host $labels.instance }} node has failed evaluations
for recording rules. Verify the rules state in the
Status/Rules section of the Prometheus web UI. |
Raise condition | rate(prometheus_rule_evaluation_failures_total[5m]) > 0 |
Description | Raises when the number of evaluation failures of Promethues recording rules continuously increases for 10 minutes. The issue typically occurs once you reload the Prometheus service after configuration changes. |
Troubleshooting | Verify the syntax and metrics in the recently added custom Prometheus recording rules in the cluster model. |
Tuning | Not required |