InfluxDB

InfluxDB

This section describes the alerts for InfluxDB, InfluxDB Relay, and remote storage adapter.

Warning

InfluxDB, including InfluxDB Relay and remote storage adapter, is deprecated in the Q4`18 MCP release and will be removed in the next release.


InfluxdbServiceDown

Severity

Minor

Summary

The InfluxDB service on the {{ $labels.host }} node is down.

Raise condition

influxdb_up == 0

Description

Raises when the InfluxDB service on one of the mtr nodes is down. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Verify the InfluxDB service status on the affected node using systemctl status influxdb.

  • Inspect the InfluxDB service logs on the affected node using journalctl -xfu influxdb.

  • Verify the available disk space using df -h.

Tuning

Not required

InfluxdbServicesDownMinor

Severity

Minor

Summary

More than 30% of InfluxDB services are down.

Raise condition

count(influxdb_up == 0) >= count(influxdb_up) * 0.3

Description

Raises when InfluxDB services are down on more than 30% of mtr nodes.

Troubleshooting

  • Inspect the InfluxdbServiceDown alerts for the host names of the affected nodes.

  • Verify the InfluxDB service status on the affected node using systemctl status influxdb.

  • Inspect the InfluxDB service logs on the affected node using journalctl -xfu influxdb.

  • Verify the available disk space using df -h.

Tuning

Not required

InfluxdbServicesDownMajor

Severity

Major

Summary

More than 60% of InfluxDB services are down.

Raise condition

count(influxdb_up == 0) >= count(influxdb_up) * 0.6

Description

Raises when InfluxDB services are down on more than 60% of the mtr nodes.

Troubleshooting

  • Inspect the InfluxdbServiceDown alerts for the host names of the affected nodes.

  • Verify the InfluxDB service status on the affected node using systemctl status influxdb.

  • Inspect the InfluxDB service logs on the affected node using journalctl -xfu influxdb.

  • Verify the available disk space using df -h.

Tuning

Not required

InfluxdbServiceOutage

Severity

Critical

Summary

All InfluxDB services are down.

Raise condition

count(influxdb_up == 0) == count(influxdb_up)

Description

Raises when InfluxDB services are down on all mtr nodes.

Troubleshooting

  • Inspect the InfluxdbServiceDown alerts for the host names of the affected nodes.

  • Verify the InfluxDB service status on the affected node using systemctl status influxdb.

  • Inspect the InfluxDB service logs on the affected node using journalctl -xfu influxdb.

  • Verify the available disk space using df -h.

Tuning

Not required

InfluxdbSeriesMaxNumberWarning

Severity

Warning

Summary

The InfluxDB database contains 950000 time series.

Raise condition

influxdb_database_numSeries >= 950000

Description

Raises when the number of series collected by InfluxDB reaches the threshold of 95%. InfluxDB continues collecting the data. However, reaching the maximum series threshold is critical.

Troubleshooting

  • Decrease the retention policy for the affected database.

  • Remove unused data.

  • Increase the maximum number of series to keep in the database.

Tuning

Typically, you should not change the default value. If the alert is constantly firing, increase the max_series_per_database parameter to a ten times bigger value.

For example, to change the threshold to 9 500 000 and the number of series to 10 000 000:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbSeriesMaxNumberWarning:
              if: >-
                influxdb_database_numSeries >= 9500000
    
  3. In cluster/<cluster_name>/stacklight/telemetry.yml, add:

    parameters:
      influxdb:
        server:
          data:
            max_series_per_database: 10000000
    
  4. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    salt -C 'I@influxdb:server' influxdb.server
    
  5. Verify the updated alert definition in the Prometheus web UI.

InfluxdbSeriesMaxNumberCritical

Severity

Critical

Summary

The InfluxDB database contains 1000000 time series. No more series can be saved.

Raise condition

influxdb_database_numSeries >= 1000000

Description

Raises when the number of series collected by InfluxDB reaches the critical threshold of 1M series. InfluxDB is available but cannot collect more data. Any write request to the database ends with the HTTP 500 status code and the max series per database exceeded error message. It is not possible to define the data that has not been recorded.

Warning

For production environments, after deployment set both the threshold and the max_series_per_database parameter value to 10 000 000.

Troubleshooting

  • Decrease the retention policy for the affected database.

  • Remove unused data.

  • Increase the maximum number of series to keep in the database.

Tuning

For example, to change the number of series and the threshold to 10 000 000:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbSeriesMaxNumberCritical:
              if: >-
                influxdb_database_numSeries >= 10000000
    
  3. In cluster/<cluster_name>/stacklight/telemetry.yml, add:

    parameters:
      influxdb:
        server:
          data:
            max_series_per_database: 10000000
    
  4. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    salt -C 'I@influxdb:server' influxdb.server
    
  5. Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPClientErrorsWarning

Severity

Warning

Summary

An average of 5% of HTTP client requests on the {{ $labels.host }} node fail.

Raise condition

rate(influxdb_httpd_clientError[1m]) / rate(influxdb_httpd_req[1m]) * 100 > 5

Description

Raises when the percentage of client error HTTP requests rate reaches the threshold of 5%, indicating issues with the request format, service performance, or the maximum number of series being reached. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Inspect InfluxDB logs on the affected node using journalctl -xfu influxdb.

  • Verify if the InfluxdbSeriesMaxNumberWarning is firing.

Tuning

For example, to change the threshold to 10%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbHTTPClientErrorsWarning:
              if: >-
                rate(influxdb_httpd_clientError[1m]) / \
                rate(influxdb_httpd_req[1m]) * 100 > 10
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPPointsWritesFailWarning

Severity

Warning

Summary

More than 5% of HTTP points writes on the {{ $labels.host }} node fail.

Raise condition

rate(influxdb_httpd_pointsWrittenFail[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + rate(influxdb_httpd_pointsWrittenFail[1m]) + rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5

Description

Raises when the percentage of client failed HTTP write requests reached the threshold of 5%, indicating a non-existing database or reaching of the maximum series threshold. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Inspect the InfluxDB logs on the affected node using journalctl -xfu influxdb.

  • Verify if the InfluxdbSeriesMaxNumberWarning is firing.

Tuning

For example, to change the threshold to 10%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbHTTPPointsWritesFailWarning:
              if: >-
                rate(influxdb_httpd_pointsWrittenFail[1m]) / \
                (rate(influxdb_httpd_pointsWrittenOK[1m]) + \
                rate(influxdb_httpd_pointsWrittenFail[1m]) + \
                rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

InfluxdbHTTPPointsWritesDropWarning

Severity

Warning

Summary

More than 5% of HTTP points writes on the {{ $labels.host }} node were dropped.

Raise condition

rate(influxdb_httpd_pointsWrittenDropped[1m]) / (rate(influxdb_httpd_pointsWrittenOK[1m]) + rate(influxdb_httpd_pointsWrittenFail[1m]) + rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 5

Description

Raises when the percentage of client HTTP drop measurements requests reaches the threshold of 5%. Dropping of measurements must be a controlled operation, determined by the retention policy or manual actions. This alert is expected during maintenance. Otherwise, investigate the reasons. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

Inspect the InfluxDB logs on the affected node using journalctl -xfu influxdb.

Tuning

For example, to change the threshold to 10%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbHTTPPointsWritesDropWarning:
              if: >-
                rate(influxdb_httpd_pointsWrittenDropped[1m]) /
                (rate(influxdb_httpd_pointsWrittenOK[1m]) + \
                rate(influxdb_httpd_pointsWrittenFail[1m]) + \
                rate(influxdb_httpd_pointsWrittenDropped[1m])) * 100 > 10
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

InfluxdbRelayBufferFullWarning

Severity

Warning

Summary

The InfluxDB Relay {{ $labels.host }} back-end buffer is 80% full.

Raise condition

influxdb_relay_backend_buffer_bytes / 5.36870912e+08 * 100 > 80

Description

Raises when the percentage of InfluxDB Relay summarized buffers usage reaches 80% of the threshold set to 512 MB and may be connected with InfluxDB issues. When the buffer is full, the requests cannot be cached.

Troubleshooting

Increase the buffer size as required.

Tuning

For example, to change the threshold to 90% and the buffer size to 1024mb:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbRelayBufferFullWarning:
              if: >-
                influxdb_relay_backend_buffer_bytes / 2^20  > 1024 * 0.9
    
  3. In the _params section in cluster/<cluster_name>/stacklight/telemetry.yml, specify:

    influxdb_relay_buffer_size_mb: 1024
    
  4. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    salt -C 'I@influxdb:server' state.sls influxdb.relay
    
  5. Verify the updated alert definition in the Prometheus web UI.

InfluxdbRelayRequestsFailWarning

Severity

Warning

Summary

An average of 5% of InfluxDB Relay requests on the {{ $labels.host }} node fail.

Raise condition

rate(influxdb_relay_failed_requests_total[1m]) / rate(influxdb_relay_requests_total[1m]) * 100 > 5

Description

Raises when the percentage of InfluxDB Relay failed requests reaches the threshold of 5%, indicating issues with the InfluxDB Relay back end availability.

Troubleshooting

  • Inspect the InfluxDB logs on the affected node using journalctl -xfu influxdb.

  • Inspect the InfluxdbRelayBufferFullWarning alert.

  • Inspect the InfluxdbSeriesMaxNumberWarning or InfluxdbSeriesMaxNumberCritical alerts.

Tuning

For example, to change threshold to 10%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            InfluxdbRelayRequestsFailWarning:
              if: >-
                rate(influxdb_relay_failed_requests_total[1m]) / \
                rate(influxdb_relay_requests_total[1m]) * 100 > 10
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RemoteStorageAdapterMetricsSendingWarning

Severity

Warning

Summary

The remote storage adapter metrics on sent to received ratio on the {{ $labels.instance }} instance is less than 0.9.

Raise condition

increase(sent_samples_total{job="remote_storage_adapter"}[1m]) / on (job, instance) increase(received_samples_total[1m]) < 0.9

Description

Raises when the sent to received metrics ratio of the remote storage adapter reaches 90%. If this ratio decreases, the adapter stops sending new metrics to a remote storage.

Troubleshooting

  • Verify that the remote storage adapter container is operating by running docker ps on the mon nodes.

  • Inspect the remote storage service logs by running docker service logs montoring_remote_storage_adapter on any mon node.

Tuning

For example, change the threshold to 2 per 10 minutes:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RemoteStorageAdapterMetricsSendingWarning:
              if: >-
                increase(sent_samples_total{job="remote_storage_adapter"}[10m])\
                 / on (job, instance) increase(received_samples_total[10m]) < 1
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

RemoteStorageAdapterMetricsIgnoredWarning

Severity

Warning

Summary

More than 5% of remote storage adapter metrics on the {{ $labels.instance }} instance are invalid.

Raise condition

increase(prometheus_influxdb_ignored_samples_total {job="remote_storage_adapter"}[1m]) / on (job, instance) increase(sent_samples_total[1m]) >= 0.05

Description

Rasies when the ignored to sent metrics ratio of the remote storage adapter reaches the default 5%, indicating that at least 5% of the metrics sent from the remote storage adapter were ignored by InfluxDB.

Troubleshooting

  • Inspect the InfluxDB alerts.

  • Inspect the remote storage service logs by running docker service logs montoring_remote_storage_adapter on any mon node.

Tuning

For example, to change the threshold to 10%:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            RemoteStorageAdapterMetricsIgnoredWarning:
              if: >-
                increase(prometheus_influxdb_ignored_samples_total\
                {job="remote_storage_adapter"}[1m]) / on (job, instance) \
                increase(sent_samples_total[1m]) >= 0.1
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.