Memcached

Memcached

This section describes the alerts for the Memcached service.


MemcachedServiceDown

Severity Minor
Summary The Memcached service on the {{ $labels.host }} node is down.
Raise condition memcached_up == 0
Description Raised when Telegraf cannot gather metrics from the Memcached service, typically indicating that Memcached is down on one node and caching does not work on that node. The host label in the raised alert contains the host name of the affected node
Troubleshooting
  • Verify the Memcached service status using systemctl status memcached.
  • Inspect the Memcached service logs using journalctl -xfu memcached.
Tuning Not required

MemcachedServiceRespawn

Removed since the 2019.2.4 maintenance update.

Severity Warning
Summary The Memcached service on the {{ $labels.host }} node was respawned.
Raise condition memcached_uptime < 180
Description

Raises when the Memcached service uptime is below 180 seconds, indicating that it was recently respawned (restarted). If Memcached respawning happened during maintenance, the alert is expected. Otherwise, this alert indicates an issue with the service. The host label in the raised alert contains the host name of the affected node.

Warning

The alert is a partial duplicate of MemcachedServiceDown and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, verify and disable this alert.

Troubleshooting
  • Verify the Memcached service status using systemctl status memcached.
  • Inspect the Memcached service logs using journalctl -xfu memcached.
Tuning Disable the alert as described in Manage alerts.

MemcachedConnectionThrottled

Severity Warning
Summary More than 5 client connections to the Memcached database on the {{ $labels.host }} node throttle for 2 minutes.
Raise condition increase(memcached_conn_yields[1m]) > 5
Description Raises when the number of times the Memcached connection was throttled reaches 5 over the last minute. This warning appears with the Too many open connections error message in Memcached. Too many connections may cause an error in writing because of the process starvation (blocking). To avoid this, Memcached throttles the connection. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node. Then run stats to obtain the server information.
  • Inspect the Memcached service logs using journalctl -xfu memcached.
  • Adjust the threshold if required.
Tuning

To change the throttling threshold to 10:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            MemcachedConnectionThrottled:
              if: >-
                increase(memcached_conn_yields[1m]) > 10
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

MemcachedConnectionsNoneMinor

Severity Minor
Summary The Memcached database on the {{ $labels.host }} node has no open connections.
Raise condition memcached_curr_connections == 0
Description Raises when no connections to Memcached exist on one node, typically indicating that the connections were dropped. The state may affect performance. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  • Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node. Then run stats to obtain the server information.
  • Inspect the Memcached service logs using journalctl -xfu memcached.
Tuning Not required

MemcachedConnectionsNoneMajor

Severity Major
Summary The Memcached database has no open connections on all nodes.
Raise condition count(memcached_curr_connections == 0) == count(memcached_up)
Description Raises when no connections to Memcached exist on all nodes, indicating that Memcached has no client connected to it and does not receive data.
Troubleshooting
  • Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node. Then run stats to obtain the server information.
  • Inspect the Memcached service logs using journalctl -xfu memcached.
Tuning Not required

MemcachedItemsNoneMinor

Removed since the 2019.2.4 maintenance update.

Severity Minor
Summary The Memcached database on the {{ $labels.host }} node is empty.
Raise condition memcached_curr_items == 0
Description

Raises when a Memcached database has no items on one node. As Memcached is an in-memory database, this may be the result of Memcached respawn. Otherwise, investigate the reason. The host label in the raised alert contains the host name of the affected node.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Troubleshooting
  1. To confirm the issue, use telnet to connect to Memcached by running telnet localhost 11211 on the affected node.
  2. Run stats and search for curr_items and evictions to verify that the items were not removed before their TTL.
  3. Run stats items for further details on the status of the items.
Tuning Disable the alert as described in Manage alerts.

MemcachedEvictionsLimit

Severity Warning
Summary More than 10 evictions in the Memcached database occurred on the {{ $labels.host }} node during the last minute.
Raise condition increase(memcached_evictions[1m]) > 10
Description Raises when the number of Memcached items that were removed before the ending of TTL has increased by 10 (default threshold) over the last minute. Memcached is used on the OpenStack controller nodes to cache the service authentication tokens. A high number of evictions indicates a heavy token rotation since old items must be removed to free the space for the new ones, based on pseudo-LRU. The host label in the raised alert contains the host name of the affected node.
Troubleshooting
  1. Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node.
  2. Run stats slabs and search for total_pages, chunk_size, and chunks_per_page to verify if the slabs consume too much space.
Tuning

To change the evictions limit to 60:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            MemcachedEvictionsLimit:
              if: >-
                increase(memcached_evictions[1m]) > 60
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.