InfluxDB

InfluxDB¶

This section describes the alerts for InfluxDB, InfluxDB Relay, and remote storage adapter.

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.

InfluxdbServiceDown
InfluxdbServicesDownMinor
InfluxdbServicesDownMajor
InfluxdbServiceOutage
InfluxdbSeriesMaxNumberWarning
InfluxdbSeriesMaxNumberCritical
InfluxdbHTTPClientErrorsWarning
InfluxdbHTTPPointsWritesFailWarning
InfluxdbHTTPPointsWritesDropWarning
InfluxdbRelayBufferFullWarning
InfluxdbRelayRequestsFailWarning
RemoteStorageAdapterMetricsSendingWarning
RemoteStorageAdapterMetricsIgnoredWarning

InfluxdbServiceDown¶

Severity	Minor
Summary	The InfluxDB service on the `{{ $labels.host }}` node is down.
Raise condition	`influxdb_up == 0`
Description	Raises when the InfluxDB service on one of the `mtr` nodes is down. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbServicesDownMinor¶

Severity	Minor
Summary	More than 30% of InfluxDB services are down.
Raise condition	`count(influxdb_up == 0) >= count(influxdb_up) * 0.3`
Description	Raises when InfluxDB services are down on more than 30% of `mtr` nodes.
Troubleshooting	Inspect the `InfluxdbServiceDown` alerts for the host names of the affected nodes. Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbServicesDownMajor¶

Severity	Major
Summary	More than 60% of InfluxDB services are down.
Raise condition	`count(influxdb_up == 0) >= count(influxdb_up) * 0.6`
Description	Raises when InfluxDB services are down on more than 60% of the `mtr` nodes.
Troubleshooting	Inspect the `InfluxdbServiceDown` alerts for the host names of the affected nodes. Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbServiceOutage¶

Severity	Critical
Summary	All InfluxDB services are down.
Raise condition	`count(influxdb_up == 0) == count(influxdb_up)`
Description	Raises when InfluxDB services are down on all `mtr` nodes.
Troubleshooting	Inspect the `InfluxdbServiceDown` alerts for the host names of the affected nodes. Verify the InfluxDB service status on the affected node using `systemctl status influxdb`. Inspect the InfluxDB service logs on the affected node using `journalctl -xfu influxdb`. Verify the available disk space using `df -h`.
Tuning	Not required

InfluxdbSeriesMaxNumberWarning¶

Severity	Warning
Summary	The InfluxDB database contains 950000 time series.
Raise condition	`influxdb_database_numSeries >= 950000`
Description	Raises when the number of series collected by InfluxDB reaches the threshold of 95%. InfluxDB continues collecting the data. However, reaching the maximum series threshold is critical.
Troubleshooting	Decrease the retention policy for the affected database. Remove unused data. Increase the maximum number of series to keep in the database.
Tuning	Typically, you should not change the default value. If the alert is constantly firing, increase the `max_series_per_database` parameter to a ten times bigger value. For example, to change the threshold to `9 500 000` and the number of series to `10 000 000`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbSeriesMaxNumberWarning: if: >- influxdb_database_numSeries >= 9500000 In `cluster/<cluster_name>/stacklight/telemetry.yml`, add: parameters: influxdb: server: data: max_series_per_database: 10000000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server salt -C 'I@influxdb:server' influxdb.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbSeriesMaxNumberCritical¶

Severity	Critical
Summary	The InfluxDB database contains 1000000 time series. No more series can be saved.
Raise condition	`influxdb_database_numSeries >= 1000000`
Description	Raises when the number of series collected by InfluxDB reaches the critical threshold of 1M series. InfluxDB is available but cannot collect more data. Any write request to the database ends with the `HTTP 500` status code and the max series per database exceeded error message. It is not possible to define the data that has not been recorded. Warning For production environments, after deployment set both the threshold and the `max_series_per_database` parameter value to `10 000 000`.
Troubleshooting	Decrease the retention policy for the affected database. Remove unused data. Increase the maximum number of series to keep in the database.
Tuning	For example, to change the number of series and the threshold to `10 000 000`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbSeriesMaxNumberCritical: if: >- influxdb_database_numSeries >= 10000000 In `cluster/<cluster_name>/stacklight/telemetry.yml`, add: parameters: influxdb: server: data: max_series_per_database: 10000000 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server salt -C 'I@influxdb:server' influxdb.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPClientErrorsWarning¶

Severity	Warning
Summary	An average of 5% of HTTP client requests on the `{{ $labels.host }}` node fail.
Raise condition	`rate(influxdb_httpd_clientError[1m]) / rate(influxdb_httpd_req[1m]) * 100 > 5`
Description	Raises when the percentage of client error HTTP requests rate reaches the threshold of 5%, indicating issues with the request format, service performance, or the maximum number of series being reached. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect InfluxDB logs on the affected node using `journalctl -xfu influxdb`. Verify if the `InfluxdbSeriesMaxNumberWarning` is firing.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbHTTPClientErrorsWarning: if: >- rate(influxdb_httpd_clientError[1m]) / \ rate(influxdb_httpd_req[1m]) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPPointsWritesFailWarning¶

Severity	Warning
Summary	More than 5% of HTTP points writes on the `{{ $labels.host }}` node fail.
Raise condition	`rate(influxdb_httpd_pointsWrittenFail[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + rate(influxdb_httpd_pointsWrittenFail[1m]) + rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5`
Description	Raises when the percentage of client failed HTTP write requests reached the threshold of 5%, indicating a non-existing database or reaching of the maximum series threshold. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the InfluxDB logs on the affected node using `journalctl -xfu influxdb`. Verify if the `InfluxdbSeriesMaxNumberWarning` is firing.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbHTTPPointsWritesFailWarning: if: >- rate(influxdb_httpd_pointsWrittenFail[1m]) / \ (rate(influxdb_httpd_pointsWrittenOK[1m]) + \ rate(influxdb_httpd_pointsWrittenFail[1m]) + \ rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPPointsWritesDropWarning¶

Severity	Warning
Summary	More than 5% of HTTP points writes on the `{{ $labels.host }}` node were dropped.
Raise condition	`rate(influxdb_httpd_pointsWrittenDropped[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + rate(influxdb_httpd_pointsWrittenFail[1m]) + rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5`
Description	Raises when the percentage of client HTTP drop measurements requests reaches the threshold of 5%. Dropping of measurements must be a controlled operation, determined by the retention policy or manual actions. This alert is expected during maintenance. Otherwise, investigate the reasons. The `host` label in the raised alert contains the host name of the affected node.
Troubleshooting	Inspect the InfluxDB logs on the affected node using `journalctl -xfu influxdb`.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbHTTPPointsWritesDropWarning: if: >- rate(influxdb_httpd_pointsWrittenDropped[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + \ rate(influxdb_httpd_pointsWrittenFail[1m]) + \ rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

InfluxdbRelayBufferFullWarning¶

Severity	Warning
Summary	The InfluxDB Relay `{{ $labels.host }}` back-end buffer is 80% full.
Raise condition	`influxdb_relay_backend_buffer_bytes / 5.36870912e+08 * 100 > 80`
Description	Raises when the percentage of InfluxDB Relay summarized buffers usage reaches 80% of the threshold set to 512 MB and may be connected with InfluxDB issues. When the buffer is full, the requests cannot be cached.
Troubleshooting	Increase the buffer size as required.
Tuning	For example, to change the threshold to `90%` and the buffer size to `1024mb`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbRelayBufferFullWarning: if: >- influxdb_relay_backend_buffer_bytes / 2^20 > 1024 * 0.9 In the `_params` section in `cluster/<cluster_name>/stacklight/telemetry.yml`, specify: influxdb_relay_buffer_size_mb: 1024 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server salt -C 'I@influxdb:server' state.sls influxdb.relay Verify the updated alert definition in the Prometheus web UI.

InfluxdbRelayRequestsFailWarning¶

Severity	Warning
Summary	An average of 5% of InfluxDB Relay requests on the `{{ $labels.host }}` node fail.
Raise condition	`rate(influxdb_relay_failed_requests_total[1m]) / rate(influxdb_relay_requests_total[1m]) * 100 > 5`
Description	Raises when the percentage of InfluxDB Relay failed requests reaches the threshold of 5%, indicating issues with the InfluxDB Relay back end availability.
Troubleshooting	Inspect the InfluxDB logs on the affected node using `journalctl -xfu influxdb`. Inspect the `InfluxdbRelayBufferFullWarning` alert. Inspect the `InfluxdbSeriesMaxNumberWarning` or `InfluxdbSeriesMaxNumberCritical` alerts.
Tuning	For example, to change threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: InfluxdbRelayRequestsFailWarning: if: >- rate(influxdb_relay_failed_requests_total[1m]) / \ rate(influxdb_relay_requests_total[1m]) * 100 > 10 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RemoteStorageAdapterMetricsSendingWarning¶

Severity	Warning
Summary	The remote storage adapter metrics on sent to received ratio on the `{{ $labels.instance }}` instance is less than 0.9.
Raise condition	`increase(sent_samples_total{job="remote_storage_adapter"}[1m]) / on (job, instance) increase(received_samples_total[1m]) < 0.9`
Description	Raises when the sent to received metrics ratio of the remote storage adapter reaches 90%. If this ratio decreases, the adapter stops sending new metrics to a remote storage.
Troubleshooting	Verify that the remote storage adapter container is operating by running `docker ps` on the `mon` nodes. Inspect the remote storage service logs by running `docker service logs montoring_remote_storage_adapter` on any `mon` node.
Tuning	For example, change the threshold to `2` per 10 minutes: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RemoteStorageAdapterMetricsSendingWarning: if: >- increase(sent_samples_total{job="remote_storage_adapter"}[10m])\ / on (job, instance) increase(received_samples_total[10m]) < 1 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

RemoteStorageAdapterMetricsIgnoredWarning¶

Severity	Warning
Summary	More than 5% of remote storage adapter metrics on the `{{ $labels.instance }}` instance are invalid.
Raise condition	`increase(prometheus_influxdb_ignored_samples_total {job="remote_storage_adapter"}[1m]) / on (job, instance) increase(sent_samples_total[1m]) >= 0.05`
Description	Rasies when the ignored to sent metrics ratio of the remote storage adapter reaches the default 5%, indicating that at least 5% of the metrics sent from the remote storage adapter were ignored by InfluxDB.
Troubleshooting	Inspect the InfluxDB alerts. Inspect the remote storage service logs by running `docker service logs montoring_remote_storage_adapter` on any `mon` node.
Tuning	For example, to change the threshold to `10%`: On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file. Create a file for alert customizations: touch cluster/<cluster_name>/stacklight/custom/alerts.yml Define the new file in `cluster/<cluster_name>/stacklight/server.yml`: classes: - cluster.<cluster_name>.stacklight.custom.alerts ... In the defined alert customizations file, modify the alert threshold by overriding the `if` parameter: parameters: prometheus: server: alert: RemoteStorageAdapterMetricsIgnoredWarning: if: >- increase(prometheus_influxdb_ignored_samples_total\ {job="remote_storage_adapter"}[1m]) / on (job, instance) \ increase(sent_samples_total[1m]) >= 0.1 From the Salt Master node, apply the changes: salt 'I@prometheus:server' state.sls prometheus.server Verify the updated alert definition in the Prometheus web UI.

updated: 2025-01-10 08:56

Heka

View Previous Section

Kibana

InfluxDB

InfluxDB¶

InfluxdbServiceDown¶

InfluxdbServicesDownMinor¶

InfluxdbServicesDownMajor¶

InfluxdbServiceOutage¶

InfluxdbSeriesMaxNumberWarning¶

InfluxdbSeriesMaxNumberCritical¶

InfluxdbHTTPClientErrorsWarning¶

InfluxdbHTTPPointsWritesFailWarning¶

InfluxdbHTTPPointsWritesDropWarning¶

InfluxdbRelayBufferFullWarning¶

InfluxdbRelayRequestsFailWarning¶

RemoteStorageAdapterMetricsSendingWarning¶

RemoteStorageAdapterMetricsIgnoredWarning¶

View Previous Section

View Next Section