InfluxDB
This section describes the alerts for InfluxDB, InfluxDB Relay, and remote
storage adapter.
Warning
InfluxDB, including InfluxDB Relay and remote storage adapter,
is deprecated in the Q4`18 MCP release and will be removed in the next
release.
InfluxdbServiceDown
Severity |
Minor |
Summary |
The InfluxDB service on the {{ $labels.host }} node is down. |
Raise condition |
influxdb_up == 0 |
Description |
Raises when the InfluxDB service on one of the mtr nodes is down.
The host label in the raised alert contains the host name of the
affected node. |
Troubleshooting |
- Verify the InfluxDB service status on the affected node using
systemctl status influxdb .
- Inspect the InfluxDB service logs on the affected node using
journalctl -xfu influxdb .
- Verify the available disk space using
df -h .
|
Tuning |
Not required |
InfluxdbServicesDownMinor
Severity |
Minor |
Summary |
More than 30% of InfluxDB services are down. |
Raise condition |
count(influxdb_up == 0) >= count(influxdb_up) * 0.3 |
Description |
Raises when InfluxDB services are down on more than 30% of mtr
nodes. |
Troubleshooting |
- Inspect the
InfluxdbServiceDown alerts for the host names of the
affected nodes.
- Verify the InfluxDB service status on the affected node using
systemctl status influxdb .
- Inspect the InfluxDB service logs on the affected node using
journalctl -xfu influxdb .
- Verify the available disk space using
df -h .
|
Tuning |
Not required |
InfluxdbServicesDownMajor
Severity |
Major |
Summary |
More than 60% of InfluxDB services are down. |
Raise condition |
count(influxdb_up == 0) >= count(influxdb_up) * 0.6 |
Description |
Raises when InfluxDB services are down on more than 60% of the mtr
nodes. |
Troubleshooting |
- Inspect the
InfluxdbServiceDown alerts for the host names of the
affected nodes.
- Verify the InfluxDB service status on the affected node using
systemctl status influxdb .
- Inspect the InfluxDB service logs on the affected node using
journalctl -xfu influxdb .
- Verify the available disk space using
df -h .
|
Tuning |
Not required |
InfluxdbServiceOutage
Severity |
Critical |
Summary |
All InfluxDB services are down. |
Raise condition |
count(influxdb_up == 0) == count(influxdb_up) |
Description |
Raises when InfluxDB services are down on all mtr nodes. |
Troubleshooting |
- Inspect the
InfluxdbServiceDown alerts for the host names of the
affected nodes.
- Verify the InfluxDB service status on the affected node using
systemctl status influxdb .
- Inspect the InfluxDB service logs on the affected node using
journalctl -xfu influxdb .
- Verify the available disk space using
df -h .
|
Tuning |
Not required |
InfluxdbSeriesMaxNumberWarning
Severity |
Warning |
Summary |
The InfluxDB database contains 950000 time series. |
Raise condition |
influxdb_database_numSeries >= 950000 |
Description |
Raises when the number of series collected by InfluxDB reaches the
threshold of 95%. InfluxDB continues collecting the data. However,
reaching the maximum series threshold is critical. |
Troubleshooting |
- Decrease the retention policy for the affected database.
- Remove unused data.
- Increase the maximum number of series to keep in the database.
|
Tuning |
Typically, you should not change the default value. If the alert is
constantly firing, increase the max_series_per_database parameter
to a ten times bigger value.
For example, to change the threshold to 9 500 000 and the number of
series to 10 000 000 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbSeriesMaxNumberWarning:
if: >-
influxdb_database_numSeries >= 9500000
In cluster/<cluster_name>/stacklight/telemetry.yml , add:
parameters:
influxdb:
server:
data:
max_series_per_database: 10000000
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
salt -C 'I@influxdb:server' influxdb.server
Verify the updated alert definition in the Prometheus web UI.
|
InfluxdbSeriesMaxNumberCritical
Severity |
Critical |
Summary |
The InfluxDB database contains 1000000 time series. No more series can
be saved. |
Raise condition |
influxdb_database_numSeries >= 1000000 |
Description |
Raises when the number of series collected by InfluxDB reaches the
critical threshold of 1M series. InfluxDB is available but cannot
collect more data. Any write request to the database ends with the
HTTP 500 status code and the max series per database exceeded
error message. It is not possible to define the data that has not been
recorded.
Warning
For production environments, after deployment set both the
threshold and the max_series_per_database parameter value to
10 000 000 .
|
Troubleshooting |
- Decrease the retention policy for the affected database.
- Remove unused data.
- Increase the maximum number of series to keep in the database.
|
Tuning |
For example, to change the number of series and the threshold to
10 000 000 :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbSeriesMaxNumberCritical:
if: >-
influxdb_database_numSeries >= 10000000
In cluster/<cluster_name>/stacklight/telemetry.yml , add:
parameters:
influxdb:
server:
data:
max_series_per_database: 10000000
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
salt -C 'I@influxdb:server' influxdb.server
Verify the updated alert definition in the Prometheus web UI.
|
InfluxdbHTTPClientErrorsWarning
Severity |
Warning |
Summary |
An average of 5% of HTTP client requests on the {{ $labels.host }}
node fail. |
Raise condition |
rate(influxdb_httpd_clientError[1m]) / rate(influxdb_httpd_req[1m])
* 100 > 5 |
Description |
Raises when the percentage of client error HTTP requests rate reaches
the threshold of 5%, indicating issues with the request format, service
performance, or the maximum number of series being reached. The host
label in the raised alert contains the host name of the affected node. |
Troubleshooting |
- Inspect InfluxDB logs on the affected node using
journalctl -xfu influxdb .
- Verify if the
InfluxdbSeriesMaxNumberWarning is firing.
|
Tuning |
For example, to change the threshold to 10% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbHTTPClientErrorsWarning:
if: >-
rate(influxdb_httpd_clientError[1m]) / \
rate(influxdb_httpd_req[1m]) * 100 > 10
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
InfluxdbHTTPPointsWritesFailWarning
Severity |
Warning |
Summary |
More than 5% of HTTP points writes on the {{ $labels.host }} node
fail. |
Raise condition |
rate(influxdb_httpd_pointsWrittenFail[1m]) /
(rate(influxdb_httpd_pointsWrittenOK[1m]) +
rate(influxdb_httpd_pointsWrittenFail[1m]) +
rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5 |
Description |
Raises when the percentage of client failed HTTP write requests
reached the threshold of 5%, indicating a non-existing database or
reaching of the maximum series threshold. The host label in the
raised alert contains the host name of the affected node. |
Troubleshooting |
- Inspect the InfluxDB logs on the affected node using
journalctl -xfu influxdb .
- Verify if the
InfluxdbSeriesMaxNumberWarning is firing.
|
Tuning |
For example, to change the threshold to 10% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbHTTPPointsWritesFailWarning:
if: >-
rate(influxdb_httpd_pointsWrittenFail[1m]) / \
(rate(influxdb_httpd_pointsWrittenOK[1m]) + \
rate(influxdb_httpd_pointsWrittenFail[1m]) + \
rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
InfluxdbHTTPPointsWritesDropWarning
Severity |
Warning |
Summary |
More than 5% of HTTP points writes on the {{ $labels.host }} node
were dropped. |
Raise condition |
rate(influxdb_httpd_pointsWrittenDropped[1m]) /
(rate(influxdb_httpd_pointsWrittenOK[1m]) +
rate(influxdb_httpd_pointsWrittenFail[1m]) +
rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5 |
Description |
Raises when the percentage of client HTTP drop measurements requests
reaches the threshold of 5%. Dropping of measurements must be a
controlled operation, determined by the retention policy or manual
actions. This alert is expected during maintenance. Otherwise,
investigate the reasons. The host label in the raised alert contains
the host name of the affected node. |
Troubleshooting |
Inspect the InfluxDB logs on the affected node using
journalctl -xfu influxdb . |
Tuning |
For example, to change the threshold to 10% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbHTTPPointsWritesDropWarning:
if: >-
rate(influxdb_httpd_pointsWrittenDropped[1m]) /
(rate(influxdb_httpd_pointsWrittenOK[1m]) + \
rate(influxdb_httpd_pointsWrittenFail[1m]) + \
rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
InfluxdbRelayBufferFullWarning
Severity |
Warning |
Summary |
The InfluxDB Relay {{ $labels.host }} back-end buffer is 80% full. |
Raise condition |
influxdb_relay_backend_buffer_bytes / 5.36870912e+08 * 100 > 80 |
Description |
Raises when the percentage of InfluxDB Relay summarized buffers usage
reaches 80% of the threshold set to 512 MB and may be connected with
InfluxDB issues. When the buffer is full, the requests cannot be cached. |
Troubleshooting |
Increase the buffer size as required. |
Tuning |
For example, to change the threshold to 90% and the buffer size to
1024mb :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbRelayBufferFullWarning:
if: >-
influxdb_relay_backend_buffer_bytes / 2^20 > 1024 * 0.9
In the _params section in
cluster/<cluster_name>/stacklight/telemetry.yml , specify:
influxdb_relay_buffer_size_mb: 1024
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
salt -C 'I@influxdb:server' state.sls influxdb.relay
Verify the updated alert definition in the Prometheus web UI.
|
InfluxdbRelayRequestsFailWarning
Severity |
Warning |
Summary |
An average of 5% of InfluxDB Relay requests on the
{{ $labels.host }} node fail. |
Raise condition |
rate(influxdb_relay_failed_requests_total[1m]) /
rate(influxdb_relay_requests_total[1m]) * 100 > 5 |
Description |
Raises when the percentage of InfluxDB Relay failed requests reaches the
threshold of 5%, indicating issues with the InfluxDB Relay back end
availability. |
Troubleshooting |
- Inspect the InfluxDB logs on the affected node using
journalctl -xfu influxdb .
- Inspect the
InfluxdbRelayBufferFullWarning alert.
- Inspect the
InfluxdbSeriesMaxNumberWarning
or InfluxdbSeriesMaxNumberCritical alerts.
|
Tuning |
For example, to change threshold to 10% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
InfluxdbRelayRequestsFailWarning:
if: >-
rate(influxdb_relay_failed_requests_total[1m]) / \
rate(influxdb_relay_requests_total[1m]) * 100 > 10
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RemoteStorageAdapterMetricsSendingWarning
Severity |
Warning |
Summary |
The remote storage adapter metrics on sent to received ratio on the
{{ $labels.instance }} instance is less than 0.9. |
Raise condition |
increase(sent_samples_total{job="remote_storage_adapter"}[1m]) / on
(job, instance) increase(received_samples_total[1m]) < 0.9 |
Description |
Raises when the sent to received metrics ratio of the remote storage
adapter reaches 90%. If this ratio decreases, the adapter stops sending
new metrics to a remote storage. |
Troubleshooting |
- Verify that the remote storage adapter container is operating by
running
docker ps on the mon nodes.
- Inspect the remote storage service logs by running
docker service logs montoring_remote_storage_adapter on any
mon node.
|
Tuning |
For example, change the threshold to 2 per 10 minutes:
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RemoteStorageAdapterMetricsSendingWarning:
if: >-
increase(sent_samples_total{job="remote_storage_adapter"}[10m])\
/ on (job, instance) increase(received_samples_total[10m]) < 1
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|
RemoteStorageAdapterMetricsIgnoredWarning
Severity |
Warning |
Summary |
More than 5% of remote storage adapter metrics on the
{{ $labels.instance }} instance are invalid. |
Raise condition |
increase(prometheus_influxdb_ignored_samples_total
{job="remote_storage_adapter"}[1m]) / on (job, instance)
increase(sent_samples_total[1m]) >= 0.05 |
Description |
Rasies when the ignored to sent metrics ratio of the remote storage
adapter reaches the default 5%, indicating that at least 5% of the
metrics sent from the remote storage adapter were ignored by InfluxDB. |
Troubleshooting |
- Inspect the InfluxDB alerts.
- Inspect the remote storage service logs by running
docker service logs montoring_remote_storage_adapter on any
mon node.
|
Tuning |
For example, to change the threshold to 10% :
On the cluster level of the Reclass model, create a common file for
all alert customizations. Skip this step to use an existing defined
file.
Create a file for alert customizations:
touch cluster/<cluster_name>/stacklight/custom/alerts.yml
Define the new file in
cluster/<cluster_name>/stacklight/server.yml :
classes:
- cluster.<cluster_name>.stacklight.custom.alerts
...
In the defined alert customizations file, modify the alert threshold
by overriding the if parameter:
parameters:
prometheus:
server:
alert:
RemoteStorageAdapterMetricsIgnoredWarning:
if: >-
increase(prometheus_influxdb_ignored_samples_total\
{job="remote_storage_adapter"}[1m]) / on (job, instance) \
increase(sent_samples_total[1m]) >= 0.1
From the Salt Master node, apply the changes:
salt 'I@prometheus:server' state.sls prometheus.server
Verify the updated alert definition in the Prometheus web UI.
|