Prometheus

Prometheus

This section describes the alerts for the Prometheus service.


PrometheusTargetDown

Severity

Critical

Summary

The Prometheus target for the {{ $labels.job }} job on the {{ $labels.host or $labels.instance }} node is down for 2 minutes.

Raise condition

up != 1

Description

Raises when Prometheus fails to scrape a target for 2 minutes. The reasons depend on the target type. For example, Telegraf-related or connectivity issues, Fluentd misconfiguration, issues with libvirt or JMX Exporter.

Troubleshooting

Depending on the target type:

  • Inspect the Telegraf logs using journalctl -u telegraf.

  • Inspect the Fluentd logs using journalctl -u td-agent.

  • Inspect the libvirt-exporter logs using journalctl -u libvirt-exporter.

  • Inspect the jmx-exporter logs using journalctl -u jmx-exporter.

Tuning

Not required

PrometheusTargetSamplesOrderWarning

Removed since the 2019.2.4 maintenance update.

Severity

Warning

Summary

{{ $value }} Prometheus samples on the {{$labels.instance }} instance are out of order (as measured over the last minute).

Raise condition

increase(prometheus_target_scrapes_sample_out_of_order_total[1m]) > 0

Description

Raises when Prometheus observes time series samples with a wrong order.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning

Disable the alert as described in Manage alerts.

PrometheusTargetSamplesBoundsWarning

Removed since the 2019.2.4 maintenance update.

Severity

Warning

Summary

{{ $value }} Prometheus samples on the {{$labels.instance }} instance have time stamps out of bounds (as measured over the last minute).

Raise condition

increase(prometheus_target_scrapes_sample_out_of_bounds_total[1m]) > 0

Description

Raises when Prometheus observes samples with time stamps greater than the current time.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning

Disable the alert as described in Manage alerts.

PrometheusTargetSamplesDuplicateWarning

Removed since the 2019.2.4 maintenance update.

Severity

Warning

Summary

{{ $value }} Prometheus samples on the {{$labels.instance }} instance have duplicate time stamps (as measured over the last minute).

Raise condition

increase(prometheus_target_scrapes_sample_duplicate_timestamp_total [1m]) > 0

Description

Raises when Prometheus observes samples with duplicated time stamps.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning

Disable the alert as described in Manage alerts.

PrometheusDataIngestionWarning

Removed since the 2019.2.4 maintenance update.

Severity

Warning

Summary

The Prometheus service writes on the {{$labels.instance }} instance do not keep up with the data ingestion speed for 10 minutes.

Raise condition

prometheus_local_storage_rushed_mode != 0

Description

Raises when the Prometheus service writes do not keep up with the data ingestion speed for 10 minutes.

Warning

The alert is deprecated for Prometheus versions newer than 1.7 and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning

Disable the alert as described in Manage alerts.

PrometheusRemoteStorageQueueFullWarning

Severity

Warning

Summary

The Prometheus remote storage queue on the {{$labels.instance }} instance is 75% full for 2 minutes.

Raise condition

prometheus_remote_storage_queue_length / prometheus_remote_storage_queue_capacity * 100 > 75

Description

Raises when Prometheus remote write queue is 75% full.

Troubleshooting

Inspect the remote write service configuration in the remote_write section in /srv/volumes/local/prometheus/config/prometheus.yml.

Tuning

For example, to change the warning threshold to 90%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            PrometheusRemoteStorageQueueFullWarning:
              if: >-
                prometheus_remote_storage_queue_length / \
                prometheus_remote_storage_queue_capacity * 100 > 90
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

PrometheusRelayServiceDown

Severity

Minor

Summary

The Prometheus Relay service on the {{$labels.host}} node is down for 2 minutes.

Raise condition

procstat_running{process_name="prometheus-relay"} == 0

Description

Raises when Telegraf cannot find running prometheus-relay processes on any mtr host.

Troubleshooting

  • Verify the status of the Prometheus Relay service on the affected node using service prometheus-relay status.

  • Inspect the Prometheus Relay logs on the affected node using journalctl -u prometheus-relay.

Tuning

Not required

PrometheusRelayServiceDownMajor

Severity

Major

Summary

More than 50% of Prometheus Relay services are down for 2 minutes.

Raise condition

count(procstat_running{process_name="prometheus-relay"} == 0) >= count (procstat_running{process_name="prometheus-relay"}) * 0.5

Description

Raises when Telegraf cannot find running prometheus-relay processes on more than 50% of the mtr hosts.

Troubleshooting

  • Inspect the PrometheusRelayServiceDown alerts for the host names of the affected nodes.

  • Verify the status of the Prometheus Relay service on the affected node using service prometheus-relay status.

  • Inspect the Prometheus Relay logs on the affected node using journalctl -u prometheus-relay.

Tuning

Not required

PrometheusRelayServiceOutage

Severity

Critical

Summary

All Prometheus Relay services are down for 2 minutes.

Raise condition

count(procstat_running{process_name="prometheus-relay"} == 0) == count (procstat_running{process_name="prometheus-relay"})

Description

Raises when Telegraf cannot find running prometheus-relay processes on all mtr hosts.

Troubleshooting

  • Inspect the PrometheusRelayServiceDown alerts for the host names of the affected nodes.

  • Verify the status of the Prometheus Relay service on the affected node using service prometheus-relay status.

  • Inspect the Prometheus Relay logs on the affected node using journalctl -u prometheus-relay.

Tuning

Not required

PrometheusLTSServiceDown

Severity

Minor

Summary

The Prometheus long-term storage service on the {{$labels.host}} node is down for 2 minutes.

Raise condition

procstat_running{process_name="prometheus"} == 0

Description

Raises when Telegraf cannot find running prometheus processes on any mtr host.

Troubleshooting

  • Verify the status of the Prometheus service on the affected node using service prometheus status.

  • Inspect the Prometheus logs on the affected node using journalctl -u prometheus.

Tuning

Not required

PrometheusLTSServiceDownMajor

Severity

Major

Summary

More than 50% of the Prometheus long-term storage services are down for 2 minutes.

Raise condition

count(procstat_running{process_name="prometheus"} == 0) >= count (procstat_running{process_name="prometheus"}) * 0.5

Description

Raises when Telegraf cannot find running prometheus processes on more than 50% of the mtr hosts.

Troubleshooting

  • Inspect the PrometheusLTSServiceDown alerts for the host names of the affected nodes.

  • Verify the status of the Prometheus service on the affected node using service prometheus status.

  • Inspect the Prometheus logs on the affected node using journalctl -u prometheus.

Tuning

Not required

PrometheusLTSServiceOutage

Severity

Critical

Summary

All Prometheus long-term storage services are down for 2 minutes.

Raise condition

count(procstat_running{process_name="prometheus"} == 0) == count (procstat_running{process_name="prometheus"})

Description

Raises when Telegraf cannot find running prometheus processes on all mtr hosts.

Troubleshooting

  • Inspect the PrometheusLTSServiceDown alerts for the host names of the affected nodes.

  • Verify the status of the Prometheus service on the affected node using service prometheus status.

  • Inspect the Prometheus logs on the affected node using journalctl -u prometheus.

Tuning

Not required

PrometheusRuleEvaluationsFailed

Available starting from the 2019.2.7 maintenance update

Severity

Warning

Summary

The Prometheus server for the {{ $labels.job }} job on the {{ or $labels.host $labels.instance }} node has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus web UI.

Raise condition

rate(prometheus_rule_evaluation_failures_total[5m]) > 0

Description

Raises when the number of evaluation failures of Promethues recording rules continuously increases for 10 minutes. The issue typically occurs once you reload the Prometheus service after configuration changes.

Troubleshooting

Verify the syntax and metrics in the recently added custom Prometheus recording rules in the cluster model.

Tuning

Not required