Memcached

Memcached

This section describes the alerts for the Memcached service.


MemcachedServiceDown

Severity

Minor

Summary

The Memcached service on the {{ $labels.host }} node is down.

Raise condition

memcached_up == 0

Description

Raised when Telegraf cannot gather metrics from the Memcached service, typically indicating that Memcached is down on one node and caching does not work on that node. The host label in the raised alert contains the host name of the affected node

Troubleshooting

  • Verify the Memcached service status using systemctl status memcached.

  • Inspect the Memcached service logs using journalctl -xfu memcached.

Tuning

Not required

MemcachedServiceRespawn

Removed since the 2019.2.4 maintenance update.

Severity

Warning

Summary

The Memcached service on the {{ $labels.host }} node was respawned.

Raise condition

memcached_uptime < 180

Description

Raises when the Memcached service uptime is below 180 seconds, indicating that it was recently respawned (restarted). If Memcached respawning happened during maintenance, the alert is expected. Otherwise, this alert indicates an issue with the service. The host label in the raised alert contains the host name of the affected node.

Warning

The alert is a partial duplicate of MemcachedServiceDown and has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, verify and disable this alert.

Troubleshooting

  • Verify the Memcached service status using systemctl status memcached.

  • Inspect the Memcached service logs using journalctl -xfu memcached.

Tuning

Disable the alert as described in Manage alerts.

MemcachedConnectionThrottled

Severity

Warning

Summary

More than 5 client connections to the Memcached database on the {{ $labels.host }} node throttle for 2 minutes.

Raise condition

increase(memcached_conn_yields[1m]) > 5

Description

Raises when the number of times the Memcached connection was throttled reaches 5 over the last minute. This warning appears with the Too many open connections error message in Memcached. Too many connections may cause an error in writing because of the process starvation (blocking). To avoid this, Memcached throttles the connection. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node. Then run stats to obtain the server information.

  • Inspect the Memcached service logs using journalctl -xfu memcached.

  • Adjust the threshold if required.

Tuning

To change the throttling threshold to 10:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            MemcachedConnectionThrottled:
              if: >-
                increase(memcached_conn_yields[1m]) > 10
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.

MemcachedConnectionsNoneMinor

Severity

Minor

Summary

The Memcached database on the {{ $labels.host }} node has no open connections.

Raise condition

memcached_curr_connections == 0

Description

Raises when no connections to Memcached exist on one node, typically indicating that the connections were dropped. The state may affect performance. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  • Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node. Then run stats to obtain the server information.

  • Inspect the Memcached service logs using journalctl -xfu memcached.

Tuning

Not required

MemcachedConnectionsNoneMajor

Severity

Major

Summary

The Memcached database has no open connections on all nodes.

Raise condition

count(memcached_curr_connections == 0) == count(memcached_up)

Description

Raises when no connections to Memcached exist on all nodes, indicating that Memcached has no client connected to it and does not receive data.

Troubleshooting

  • Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node. Then run stats to obtain the server information.

  • Inspect the Memcached service logs using journalctl -xfu memcached.

Tuning

Not required

MemcachedItemsNoneMinor

Removed since the 2019.2.4 maintenance update.

Severity

Minor

Summary

The Memcached database on the {{ $labels.host }} node is empty.

Raise condition

memcached_curr_items == 0

Description

Raises when a Memcached database has no items on one node. As Memcached is an in-memory database, this may be the result of Memcached respawn. Otherwise, investigate the reason. The host label in the raised alert contains the host name of the affected node.

Warning

The alert has been removed starting from the 2019.2.4 maintenance update. For the existing MCP deployments, disable this alert.

Troubleshooting

  1. To confirm the issue, use telnet to connect to Memcached by running telnet localhost 11211 on the affected node.

  2. Run stats and search for curr_items and evictions to verify that the items were not removed before their TTL.

  3. Run stats items for further details on the status of the items.

Tuning

Disable the alert as described in Manage alerts.

MemcachedEvictionsLimit

Severity

Warning

Summary

More than 10 evictions in the Memcached database occurred on the {{ $labels.host }} node during the last minute.

Raise condition

increase(memcached_evictions[1m]) > 10

Description

Raises when the number of Memcached items that were removed before the ending of TTL has increased by 10 (default threshold) over the last minute. Memcached is used on the OpenStack controller nodes to cache the service authentication tokens. A high number of evictions indicates a heavy token rotation since old items must be removed to free the space for the new ones, based on pseudo-LRU. The host label in the raised alert contains the host name of the affected node.

Troubleshooting

  1. Use telnet to connect to Memcached by running telnet localhost 11211 on the affected node.

  2. Run stats slabs and search for total_pages, chunk_size, and chunks_per_page to verify if the slabs consume too much space.

Tuning

To change the evictions limit to 60:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert by overriding the if parameter:

    parameters:
      prometheus:
        server:
          alert:
            MemcachedEvictionsLimit:
              if: >-
                increase(memcached_evictions[1m]) > 60
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.