Searching for results...

No results

Your search did not match anything from Mirantis documentation.
Check your spelling or try different keywords.

An error occurred

An error occurred while using the search.
Try your search again or contact us to let us know about it.

Newer documentation is now live.You are currently reading an older version.

Troubleshoot Ceph alerts

This section describes the investigation and troubleshooting steps for Ceph services.


CephOSDSlowClusterNetwork

Root cause

Network issues slow down Ceph OSD heartbeats.

Investigation

  1. Inspect the network for latency issues on the affected subnet.

  2. Display the affected Ceph OSDs:

    ceph health detail
    
  3. In the Ceph Manager daemon, collect a complete dump of heartbeats between Ceph OSDs:

    ceph tell mgr.a dump_osd_network 0
    
  4. Verify whether logs of Ceph OSD daemons contain the no reply message, which indicates network connectivity issues. For example:

    debug 2024-10-13T12:10:15.728+0000 7fac2ebcc640 -1 osd.95 87508 \
    heartbeat_check: no reply from 10.202.54.6:6802 osd.33 ever on either front or back, \
    first ping sent 2024-10-13T12:09:55.472956+0000 (oldest deadline 2024-10-13T12:10:15.472956+0000)
    
    debug 2024-10-13T12:10:15.728+0000 7fac2ebcc640 -1 osd.95 87508 \
    heartbeat_check: no reply from 10.202.54.2:6834 osd.51 ever on either front or back, \
    first ping sent 2024-10-13T12:09:55.472956+0000 (oldest deadline 2024-10-13T12:10:15.472956+0000)
    

Mitigation

  1. Find an available Ceph OSD that reports no reply for another Ceph OSD.

  2. ssh to a node where the available Ceph OSD is placed.

  3. curl the IP address with port of the unresponsive Ceph OSD and ping the IP address only. Verify the following issues:

    • Missing routes. Verify that L2Template on the related management cluster contains all required routes between LCM network, Ceph public and Ceph cluster networks.

    • Incorrect firewall rules. If all routes are in place but the ping or curl command still fails, verify firewall rules that may have conflicts with the required connection plan.

    • Failed switch, NIC, or layer-1 network. Ask your network administrator to verify the related hardware and fix issues, if any.

Reference

Ceph documentation: Network performance checks

CephOSDSlowPublicNetwork

Root cause

Network issues slow down Ceph OSD heartbeats.

Investigation and mitigation

Refer to Investigation and Mitigation sections of the CephOSDSlowClusterNetwork alert.