Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!

Starting with MOSK 25.2, the MOSK documentation set covers all product layers, including MOSK management (formerly MCC). This means everything you need is in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.

Troubleshoot Ceph alerts

This section describes the investigation and troubleshooting steps for Ceph services.


CephOSDSlowClusterNetwork

Root cause

Network issues slow down Ceph OSD heartbeats.

Investigation

  1. Inspect the network for latency issues on the affected subnet.

  2. Display the affected Ceph OSDs:

    ceph health detail
    
  3. In the Ceph Manager daemon, collect a complete dump of heartbeats between Ceph OSDs:

    ceph tell mgr.a dump_osd_network 0
    
  4. Verify whether logs of Ceph OSD daemons contain the no reply message, which indicates network connectivity issues. For example:

    debug 2024-10-13T12:10:15.728+0000 7fac2ebcc640 -1 osd.95 87508 \
    heartbeat_check: no reply from 10.202.54.6:6802 osd.33 ever on either front or back, \
    first ping sent 2024-10-13T12:09:55.472956+0000 (oldest deadline 2024-10-13T12:10:15.472956+0000)
    
    debug 2024-10-13T12:10:15.728+0000 7fac2ebcc640 -1 osd.95 87508 \
    heartbeat_check: no reply from 10.202.54.2:6834 osd.51 ever on either front or back, \
    first ping sent 2024-10-13T12:09:55.472956+0000 (oldest deadline 2024-10-13T12:10:15.472956+0000)
    

Mitigation

  1. Find an available Ceph OSD that reports no reply for another Ceph OSD.

  2. ssh to a node where the available Ceph OSD is placed.

  3. curl the IP address with port of the unresponsive Ceph OSD and ping the IP address only. Verify the following issues:

    • Missing routes. Verify that L2Template on the related management cluster contains all required routes between LCM network, Ceph public and Ceph cluster networks.

    • Incorrect firewall rules. If all routes are in place but the ping or curl command still fails, verify firewall rules that may have conflicts with the required connection plan.

    • Failed switch, NIC, or layer-1 network. Ask your network administrator to verify the related hardware and fix issues, if any.

Reference

Ceph documentation: Network performance checks

CephOSDSlowPublicNetwork

Root cause

Network issues slow down Ceph OSD heartbeats.

Investigation and mitigation

Refer to Investigation and Mitigation sections of the CephOSDSlowClusterNetwork alert.