Mirantis Container Cloud (MCC) becomes part of Mirantis OpenStack for Kubernetes (MOSK)!
Starting with MOSK 25.2, the MOSK documentation set will cover all product layers, including MOSK management (formerly MCC). This means everything you need will be in one place. The separate MCC documentation site will be retired, so please update your bookmarks for continued easy access to the latest content.
Ceph Monitors recovery¶
This section describes how to recover failed Ceph Monitors of an existing Ceph cluster in the following state:
The Ceph cluster contains failed Ceph Monitors that cannot start and hang in the
Error
orCrashLoopBackOff
state.The logs of the failed Ceph Monitor pods contain the following lines:
mon.g does not exist in monmap, will attempt to join an existing cluster ... mon.g@-1(???) e11 not in monmap and have been in a quorum before; must have been removed mon.g@-1(???) e11 commit suicide!
The Ceph cluster contains at least one
Running
Ceph Monitor and theceph -s
command outputs one healthymon
and one healthymgr
instance.
Perform the following steps for all failed Ceph Monitors at a time if not stated otherwise.
To recover failed Ceph Monitors:
Obtain and export the
kubeconfig
of the affected cluster.Scale the
rook-ceph/rook-ceph-operator
deployment down to0
replicas:kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
Delete all failed Ceph Monitor deployments:
Identify the Ceph Monitor pods in the
Error
orCrashLookBackOff
state:kubectl -n rook-ceph get pod -l 'app in (rook-ceph-mon,rook-ceph-mon-canary)'
Verify that the affected pods contain the failure logs described above:
kubectl -n rook-ceph logs <failedMonPodName>
Substitute
<failedMonPodName>
with the Ceph Monitor pod name. For example,rook-ceph-mon-g-845d44b9c6-fjc5d
.Save the identifying letters of failed Ceph Monitors for further usage. For example,
f
,e
, and so on.Delete all corresponding deployments of these pods:
Identify the affected Ceph Monitor pod deployments:
kubectl -n rook-ceph get deploy -l 'app in (rook-ceph-mon,rook-ceph-mon-canary)'
Delete the affected Ceph Monitor pod deployments. For example, if the Ceph cluster has the
rook-ceph-mon-c-845d44b9c6-fjc5d
pod in theCrashLoopBackOff
state, remove the correspondingrook-ceph-mon-c
:kubectl -n rook-ceph delete deploy rook-ceph-mon-c
Canary
mon
deployments have the suffix-canary
.
Remove all corresponding entries of Ceph Monitors from the MON map:
Enter the
ceph-tools
pod:kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \ app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
Inspect the current MON map and save the IP addresses of the failed Ceph monitors for further usage:
ceph mon dump
Remove all entries of failed Ceph Monitors using the previously saved letters:
ceph mon rm <monLetter>
Substitute
<monLetter>
with the corresponding letter of a failed Ceph Monitor.Exit the
ceph-tools
pod.
Remove all failed Ceph Monitors entries from the Rook
mon
endpoints ConfigMap:Open the
rook-ceph/rook-ceph-mon-endpoints
ConfigMap for editing:kubectl -n rook-ceph edit cm rook-ceph-mon-endpoints
Remove all entries of failed Ceph Monitors from the ConfigMap data and update the
maxMonId
value with the current number ofRunning
Ceph Monitors. For example,rook-ceph-mon-endpoints
has the followingdata
:data: csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["172.0.0.222:6789","172.0.0.223:6789","172.0.0.224:6789","172.16.52.217:6789","172.16.52.216:6789"]}]' data: a=172.0.0.222:6789,b=172.0.0.223:6789,c=172.0.0.224:6789,f=172.0.0.217:6789,e=172.0.0.216:6789 mapping: '{"node":{ "a":{"Name":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Hostname":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Address":"172.16.52.222"}, "b":{"Name":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Hostname":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Address":"172.16.52.223"}, "c":{"Name":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Hostname":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Address":"172.16.52.224"}, "e":{"Name":"kaas-node-ba3bfa17-77d2-467c-91eb-6291fb219a80","Hostname":"kaas-node-ba3bfa17-77d2-467c-91eb-6291fb219a80","Address":"172.16.52.216"}, "f":{"Name":"kaas-node-6f669490-f0c7-4d19-bf73-e51fbd6c7672","Hostname":"kaas-node-6f669490-f0c7-4d19-bf73-e51fbd6c7672","Address":"172.16.52.217"}} }' maxMonId: "5"
If
e
andf
are the letters of failed Ceph Monitors, the resulting ConfigMap data must be as follows:data: csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["172.0.0.222:6789","172.0.0.223:6789","172.0.0.224:6789"]}]' data: a=172.0.0.222:6789,b=172.0.0.223:6789,c=172.0.0.224:6789 mapping: '{"node":{ "a":{"Name":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Hostname":"kaas-node-21465871-42d0-4d56-911f-7b5b95cb4d34","Address":"172.16.52.222"}, "b":{"Name":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Hostname":"kaas-node-43991b09-6dad-40cd-93e7-1f02ed821b9f","Address":"172.16.52.223"}, "c":{"Name":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Hostname":"kaas-node-15225c81-3f7a-4eba-b3e4-a23fd86331bd","Address":"172.16.52.224"}} }' maxMonId: "3"
Back up the data of the failed Ceph Monitors one by one:
SSH to the node of a failed Ceph Monitor using the previously saved IP address.
Move the Ceph Monitor data directory to another place:
mv /var/lib/rook/mon-<letter> /var/lib/rook/mon-<letter>.backup
Close the SSH connection.
Scale the
rook-ceph/rook-ceph-operator
deployment up to1
replica:kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
Wait until all Ceph Monitors are in the
Running
state:kubectl -n rook-ceph get pod -l app=rook-ceph-mon -w
Restore the data from the backup for each recovered Ceph Monitor one by one:
Enter a recovered Ceph Monitor pod:
kubectl -n rook-ceph exec -it <monPodName> bash
Substitute
<monPodName>
with the recovered Ceph Monitor pod name. For example,rook-ceph-mon-g-845d44b9c6-fjc5d
.Recover the
mon
data backup for the current Ceph Monitor:ceph-monstore-tool /var/lib/rook/mon-<letter>.backup/data store-copy /var/lib/rook/mon-<letter>/data/
Substitute
<letter>
with the current Ceph Monitor pod letter, for example,e
.
Verify the Ceph state. The output must indicate the desired number of Ceph Monitors and all of them must be in quorum.
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s