Prometheus

Prometheus¶

This section describes the alerts for the Prometheus service.

PrometheusTargetDown
PrometheusTargetSamplesOrderWarning
PrometheusTargetSamplesBoundsWarning
PrometheusTargetSamplesDuplicateWarning
PrometheusDataIngestionWarning
PrometheusRemoteStorageQueueFullWarning
PrometheusRelayServiceDown
PrometheusRelayServiceDownMajor
PrometheusRelayServiceOutage
PrometheusLTSServiceDown
PrometheusLTSServiceDownMajor
PrometheusLTSServiceOutage
PrometheusRuleEvaluationsFailed

PrometheusTargetDown¶

Severity	Critical
Summary	The Prometheus target for the `{{ $labels.job }}` job on the `{{ $labels.host or $labels.instance }}` node is down for 2 minutes.
Raise condition	`up != 1`
Description	Raises when Prometheus fails to scrape a target for 2 minutes. The reasons depend on the target type. For example, Telegraf-related or connectivity issues, Fluentd misconfiguration, issues with libvirt or JMX Exporter.
Troubleshooting	Depending on the target type: Inspect the Telegraf logs using `journalctl -u telegraf`. Inspect the Fluentd logs using `journalctl -u td-agent`. Inspect the `libvirt-exporter` logs using `journalctl -u libvirt-exporter`. Inspect the `jmx-exporter` logs using `journalctl -u jmx-exporter`.
Tuning	Not required

PrometheusTargetSamplesOrderWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	`{{ $value }}` Prometheus samples on the `{{$labels.instance }}` instance are out of order (as measured over the last minute).
Raise condition	`increase(prometheus_target_scrapes_sample_out_of_order_total[1m]) > 0`
Description	Raises when Prometheus observes time series samples with a wrong order. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusTargetSamplesBoundsWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	`{{ $value }}` Prometheus samples on the `{{$labels.instance }}` instance have time stamps out of bounds (as measured over the last minute).
Raise condition	`increase(prometheus_target_scrapes_sample_out_of_bounds_total[1m]) > 0`
Description	Raises when Prometheus observes samples with time stamps greater than the current time. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusTargetSamplesDuplicateWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	`{{ $value }}` Prometheus samples on the `{{$labels.instance }}` instance have duplicate time stamps (as measured over the last minute).
Raise condition	`increase(prometheus_target_scrapes_sample_duplicate_timestamp_total [1m]) > 0`
Description	Raises when Prometheus observes samples with duplicated time stamps. Warning The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusDataIngestionWarning¶

^{Removed since the 2019.2.4 maintenance update.}

Severity	Warning
Summary	The Prometheus service writes on the `{{$labels.instance }}` instance do not keep up with the data ingestion speed for 10 minutes.
Raise condition	`prometheus_local_storage_rushed_mode != 0`
Description	Raises when the Prometheus service writes do not keep up with the data ingestion speed for 10 minutes. Warning The alert is deprecated for Prometheus versions newer than 1.7 and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.
Tuning	Disable the alert as described in Manage alerts.

PrometheusRemoteStorageQueueFullWarning¶

Severity	Warning
Summary	The Prometheus remote storage queue on the `{{$labels.instance }}` instance is 75% full for 2 minutes.
Raise condition	`prometheus_remote_storage_queue_length / prometheus_remote_storage_queue_capacity * 100 > 75`
Description	Raises when Prometheus remote write queue is 75% full.
Troubleshooting	Inspect the remote write service configuration in the `remote_write` section in `/srv/volumes/local/prometheus/config/prometheus.yml`.
Tuning	For example, to change the warning threshold to `90%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: PrometheusRemoteStorageQueueFullWarning: if: >- prometheus_remote_storage_queue_length / \ prometheus_remote_storage_queue_capacity * 100 > 90 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

PrometheusRelayServiceDown¶

Severity	Minor
Summary	The Prometheus Relay service on the `{{$labels.host}}` node is down for 2 minutes.
Raise condition	`procstat_running{process_name="prometheus-relay"} == 0`
Description	Raises when Telegraf cannot find running `prometheus-relay` processes on any `mtr` host.
Troubleshooting	Verify the status of the Prometheus Relay service on the affected node using `service prometheus-relay status`. Inspect the Prometheus Relay logs on the affected node using `journalctl -u prometheus-relay`.
Tuning	Not required

PrometheusRelayServiceDownMajor¶

Severity	Major
Summary	More than 50% of Prometheus Relay services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus-relay"} == 0) >= count (procstat_running{process_name="prometheus-relay"}) * 0.5`
Description	Raises when Telegraf cannot find running `prometheus-relay` processes on more than 50% of the `mtr` hosts.
Troubleshooting	Inspect the `PrometheusRelayServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus Relay service on the affected node using `service prometheus-relay status`. Inspect the Prometheus Relay logs on the affected node using `journalctl -u prometheus-relay`.
Tuning	Not required

PrometheusRelayServiceOutage¶

Severity	Critical
Summary	All Prometheus Relay services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus-relay"} == 0) == count (procstat_running{process_name="prometheus-relay"})`
Description	Raises when Telegraf cannot find running `prometheus-relay` processes on all `mtr` hosts.
Troubleshooting	Inspect the `PrometheusRelayServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus Relay service on the affected node using `service prometheus-relay status`. Inspect the Prometheus Relay logs on the affected node using `journalctl -u prometheus-relay`.
Tuning	Not required

PrometheusLTSServiceDown¶

Severity	Minor
Summary	The Prometheus long-term storage service on the `{{$labels.host}}` node is down for 2 minutes.
Raise condition	`procstat_running{process_name="prometheus"} == 0`
Description	Raises when Telegraf cannot find running `prometheus` processes on any `mtr` host.
Troubleshooting	Verify the status of the Prometheus service on the affected node using `service prometheus status`. Inspect the Prometheus logs on the affected node using `journalctl -u prometheus`.
Tuning	Not required

PrometheusLTSServiceDownMajor¶

Severity	Major
Summary	More than 50% of the Prometheus long-term storage services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus"} == 0) >= count (procstat_running{process_name="prometheus"}) * 0.5`
Description	Raises when Telegraf cannot find running `prometheus` processes on more than 50% of the `mtr` hosts.
Troubleshooting	Inspect the `PrometheusLTSServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus service on the affected node using `service prometheus status`. Inspect the Prometheus logs on the affected node using `journalctl -u prometheus`.
Tuning	Not required

PrometheusLTSServiceOutage¶

Severity	Critical
Summary	All Prometheus long-term storage services are down for 2 minutes.
Raise condition	`count(procstat_running{process_name="prometheus"} == 0) == count (procstat_running{process_name="prometheus"})`
Description	Raises when Telegraf cannot find running `prometheus` processes on all `mtr` hosts.
Troubleshooting	Inspect the `PrometheusLTSServiceDown` alerts for the host names of the affected nodes. Verify the status of the Prometheus service on the affected node using `service prometheus status`. Inspect the Prometheus logs on the affected node using `journalctl -u prometheus`.
Tuning	Not required

PrometheusRuleEvaluationsFailed¶

^{Available starting from the 2019.2.7 maintenance update}

Severity	Warning
Summary	The Prometheus server for the `{{ $labels.job }}` job on the `{{ or $labels.host $labels.instance }}` node has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus web UI.
Raise condition	`rate(prometheus_rule_evaluation_failures_total[5m]) > 0`
Description	Raises when the number of evaluation failures of Promethues recording rules continuously increases for 10 minutes. The issue typically occurs once you reload the Prometheus service after configuration changes.
Troubleshooting	Verify the syntax and metrics in the recently added custom Prometheus recording rules in the cluster model.
Tuning	Not required

updated: 2025-01-10 08:56

MongoDB

View Previous Section

Salesforce notifier

Prometheus

Prometheus¶

PrometheusTargetDown¶

PrometheusTargetSamplesOrderWarning¶

PrometheusTargetSamplesBoundsWarning¶

PrometheusTargetSamplesDuplicateWarning¶

PrometheusDataIngestionWarning¶

PrometheusRemoteStorageQueueFullWarning¶

PrometheusRelayServiceDown¶

PrometheusRelayServiceDownMajor¶

PrometheusRelayServiceOutage¶

PrometheusLTSServiceDown¶

PrometheusLTSServiceDownMajor¶

PrometheusLTSServiceOutage¶

PrometheusRuleEvaluationsFailed¶

View Previous Section

View Next Section