Prometheus

Prometheus

This section describes the alerts for the Prometheus service.


PrometheusTargetDown

Severity Critical
Summary The Prometheus target for the {{ $labels.job }} job on the {{ $labels.host or $labels.instance }} node is down for 2 minutes.
Raise condition up != 1
Description Raises when Prometheus fails to scrape a target for 2 minutes. The reasons depend on the target type. For example, Telegraf-related or connectivity issues, Fluentd misconfiguration, issues with libvirt or JMX Exporter.
Troubleshooting

Depending on the target type:

  • Inspect the Telegraf logs using journalctl -u telegraf.
  • Inspect the Fluentd logs using journalctl -u td-agent.
  • Inspect the libvirt-exporter logs using journalctl -u libvirt-exporter.
  • Inspect the jmx-exporter logs using journalctl -u jmx-exporter.
Tuning Not required

PrometheusTargetSamplesOrderWarning

Removed since the 2019.2.4 maintenance update.

Severity Warning
Summary {{ $value }} Prometheus samples on the {{$labels.instance }} instance are out of order (as measured over the last minute).
Raise condition increase(prometheus_target_scrapes_sample_out_of_order_total[1m]) > 0
Description

Raises when Prometheus observes time series samples with a wrong order.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning Disable the alert as described in Manage alerts.

PrometheusTargetSamplesBoundsWarning

Removed since the 2019.2.4 maintenance update.

Severity Warning
Summary {{ $value }} Prometheus samples on the {{$labels.instance }} instance have time stamps out of bounds (as measured over the last minute).
Raise condition increase(prometheus_target_scrapes_sample_out_of_bounds_total[1m]) > 0
Description

Raises when Prometheus observes samples with time stamps greater than the current time.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning Disable the alert as described in Manage alerts.

PrometheusTargetSamplesDuplicateWarning

Removed since the 2019.2.4 maintenance update.

Severity Warning
Summary {{ $value }} Prometheus samples on the {{$labels.instance }} instance have duplicate time stamps (as measured over the last minute).
Raise condition increase(prometheus_target_scrapes_sample_duplicate_timestamp_total [1m]) > 0
Description

Raises when Prometheus observes samples with duplicated time stamps.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning Disable the alert as described in Manage alerts.

PrometheusDataIngestionWarning

Removed since the 2019.2.4 maintenance update.

Severity Warning
Summary The Prometheus service writes on the {{$labels.instance }} instance do not keep up with the data ingestion speed for 10 minutes.
Raise condition prometheus_local_storage_rushed_mode != 0
Description

Raises when the Prometheus service writes do not keep up with the data ingestion speed for 10 minutes.

Warning

The alert is deprecated for Prometheus versions newer than 1.7 and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Tuning Disable the alert as described in Manage alerts.

PrometheusRemoteStorageQueueFullWarning

Severity Warning
Summary The Prometheus remote storage queue on the {{$labels.instance }} instance is 75% full for 2 minutes.
Raise condition prometheus_remote_storage_queue_length / prometheus_remote_storage_queue_capacity * 100 > 75
Description Raises when Prometheus remote write queue is 75% full.
Troubleshooting Inspect the remote write service configuration in the remote_write section in /srv/volumes/local/prometheus/config/prometheus.yml.
Tuning

For example, to change the warning threshold to 90%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            PrometheusRemoteStorageQueueFullWarning:
              if: >-
                prometheus_remote_storage_queue_length / \
                prometheus_remote_storage_queue_capacity * 100 > 90
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

PrometheusRelayServiceDown

Severity Minor
Summary The Prometheus Relay service on the {{$labels.host}} node is down for 2 minutes.
Raise condition procstat_running{process_name="prometheus-relay"} == 0
Description Raises when Telegraf cannot find running prometheus-relay processes on any mtr host.
Troubleshooting
  • Verify the status of the Prometheus Relay service on the affected node using service prometheus-relay status.
  • Inspect the Prometheus Relay logs on the affected node using journalctl -u prometheus-relay.
Tuning Not required

PrometheusRelayServiceDownMajor

Severity Major
Summary More than 50% of Prometheus Relay services are down for 2 minutes.
Raise condition count(procstat_running{process_name="prometheus-relay"} == 0) >= count (procstat_running{process_name="prometheus-relay"}) * 0.5
Description Raises when Telegraf cannot find running prometheus-relay processes on more than 50% of the mtr hosts.
Troubleshooting
  • Inspect the PrometheusRelayServiceDown alerts for the host names of the affected nodes.
  • Verify the status of the Prometheus Relay service on the affected node using service prometheus-relay status.
  • Inspect the Prometheus Relay logs on the affected node using journalctl -u prometheus-relay.
Tuning Not required

PrometheusRelayServiceOutage

Severity Critical
Summary All Prometheus Relay services are down for 2 minutes.
Raise condition count(procstat_running{process_name="prometheus-relay"} == 0) == count (procstat_running{process_name="prometheus-relay"})
Description Raises when Telegraf cannot find running prometheus-relay processes on all mtr hosts.
Troubleshooting
  • Inspect the PrometheusRelayServiceDown alerts for the host names of the affected nodes.
  • Verify the status of the Prometheus Relay service on the affected node using service prometheus-relay status.
  • Inspect the Prometheus Relay logs on the affected node using journalctl -u prometheus-relay.
Tuning Not required

PrometheusLTSServiceDown

Severity Minor
Summary The Prometheus long-term storage service on the {{$labels.host}} node is down for 2 minutes.
Raise condition procstat_running{process_name="prometheus"} == 0
Description Raises when Telegraf cannot find running prometheus processes on any mtr host.
Troubleshooting
  • Verify the status of the Prometheus service on the affected node using service prometheus status.
  • Inspect the Prometheus logs on the affected node using journalctl -u prometheus.
Tuning Not required

PrometheusLTSServiceDownMajor

Severity Major
Summary More than 50% of the Prometheus long-term storage services are down for 2 minutes.
Raise condition count(procstat_running{process_name="prometheus"} == 0) >= count (procstat_running{process_name="prometheus"}) * 0.5
Description Raises when Telegraf cannot find running prometheus processes on more than 50% of the mtr hosts.
Troubleshooting
  • Inspect the PrometheusLTSServiceDown alerts for the host names of the affected nodes.
  • Verify the status of the Prometheus service on the affected node using service prometheus status.
  • Inspect the Prometheus logs on the affected node using journalctl -u prometheus.
Tuning Not required

PrometheusLTSServiceOutage

Severity Critical
Summary All Prometheus long-term storage services are down for 2 minutes.
Raise condition count(procstat_running{process_name="prometheus"} == 0) == count (procstat_running{process_name="prometheus"})
Description Raises when Telegraf cannot find running prometheus processes on all mtr hosts.
Troubleshooting
  • Inspect the PrometheusLTSServiceDown alerts for the host names of the affected nodes.
  • Verify the status of the Prometheus service on the affected node using service prometheus status.
  • Inspect the Prometheus logs on the affected node using journalctl -u prometheus.
Tuning Not required

PrometheusRuleEvaluationsFailed

Available starting from the 2019.2.7 maintenance update

Severity Warning
Summary The Prometheus server for the {{ $labels.job }} job on the {{ or $labels.host $labels.instance }} node has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus web UI.
Raise condition rate(prometheus_rule_evaluation_failures_total[5m]) > 0
Description Raises when the number of evaluation failures of Promethues recording rules continuously increases for 10 minutes. The issue typically occurs once you reload the Prometheus service after configuration changes.
Troubleshooting Verify the syntax and metrics in the recently added custom Prometheus recording rules in the cluster model.
Tuning Not required