Migrate Ceph pools from one failure domain to another¶
The document describes how to change the failure domain of an already deployed Ceph cluster.
Note
This document focuses on changing the failure domain from a smaller
to wider one, for example, from host
to rack
. Using the same
instruction, you can move the failure domain from a wider to smaller scale.
Caution
Data movement implies the Ceph cluster rebalancing that may impact cluster performance, depending on the cluster size.
High-level overview of the procedure includes the following steps:
Set correct labels on the nodes.
Create the new bucket hierarchy.
Move nodes to new buckets.
Modify the CRUSH rules.
Add the manual changes to the
KaaSCephCluster
spec.Scale the Ceph controllers.
Prerequisites¶
Verify that the Ceph cluster has enough space for multiple copies of data to migrate. Mirantis highly recommends that the Ceph cluster has a minimum of 25% of free space for the procedure to succeed.
Note
The migration procedure implies data movement and optional modification of CRUSH rules that cause a large amount of data (depending on the cluster size) to be first copied to a new location in the Ceph cluster before data removal.
Create a backup of the current
KaaSCephCluster
object from the managed namespace of the management cluster:kubectl -n <managedClusterProject> get kaascephcluster -o yaml > kcc-backup.yaml
Substitute
<managedClusterProject>
with the corresponding managed cluster namespace of the management cluster.In the
rook-ceph-tools
pod on a managed cluster, obtain a backup of the CRUSH map:ceph osd getcrushmap -o /tmp/crush-map-orig crushtool -d /tmp/crush-map-orig -o /tmp/crush-map-orig.txt
Migrate Ceph pools¶
This procedure contains an example of moving failure domains of all pools from
host
to rack
. Using the same instruction, you can migrate pools from
other types of failure domains, migrate pools separately, and so on.
To migrate Ceph pools from one failure domain to another:
Set the required CRUSH topology in the
KaaSCephCluster
object for each defined node. For details on thecrush
parameter, see Node parameters.Setting the CRUSH topology to each node causes the Ceph Controller to set proper Kubernetes labels on the nodes.
Example of adding the
rack
CRUSH topology key for each node in thenodes
sectionspec: cephClusterSpec: nodes: machine1: crush: rack: rack-1 machine2: crush: rack: rack-1 machine3: crush: rack: rack-2 machine4: crush: rack: rack-2 machine5: crush: rack: rack-3 machine6: crush: rack: rack-3
On the managed cluster, verify that the required buckets and bucket types are present in the Ceph hierarchy:
Enter the
ceph-tools
pod:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
Verify that the required bucket type is present by default:
ceph osd getcrushmap -o /tmp/crush-map crushtool -d /tmp/crush-map -o /tmp/crush-map.txt cat /tmp/crush-map.txt # Look for the section named → “# types”
Example of system response:
# types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root
Verify that the buckets with the required bucket type are present:
cat /tmp/crush-map.txt # Look for the section named → “# buckets”
Example of system response of an existing
rack
bucket:# buckets rack rack-1 { id -15 id -16 class hdd # weight 0.00000 alg straw2 hash 0 }
If the required buckets are not created, create new ones with the required bucket type:
ceph osd crush add-bucket <bucketName> <bucketType> root=default
For example:
ceph osd crush add-bucket rack-1 rack root=default ceph osd crush add-bucket rack-2 rack root=default ceph osd crush add-bucket rack-3 rack root=default
Exit the
ceph-tools
pod.
Optional. Order buckets as required:
On the managed cluster, add the first Ceph CRUSH smaller bucket to its respective wider bucket:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash ceph osd crush move <smallerBucketName> <bucketType>=<widerBucketName>
Substitute the following parameters:
<smallerBucketName>
with the name of the smaller bucket, for example host name<bucketType>
with the required bucket type, for examplerack
<widerBucketName>
with the name of the wider bucket, for example rack name
For example:
ceph osd crush move kaas-node-1 rack=rack-1 root=default
Warning
Mirantis highly recommends moving one bucket at a time.
For more details, refer to official Ceph documentation: CRUHS Maps: Moving a bucket.
After the bucket is moved to the new location in the CRUSH hierarchy, verify that no data rebalancing occurs:
ceph -s
Caution
Wait for rebalancing to complete before proceeding to the next step.
Add the remaining Ceph CRUSH smaller buckets to their respective wider buckets one by one.
Scale the Ceph Controller and Rook Operator deployments to
0
replicas:kubectl -n ceph-lcm-mirantis scale deploy --all --replicas 0 kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
On the managed cluster, manually modify the CRUSH rules for Ceph pools to enable data placement on a new failure domain:
Enter the
ceph-tools
pod:kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
List the CRUSH rules and erasure code profiles for the pools:
ceph osd pool ls detail
Example output
pool 1 'mirablock-k8s-block-hdd' replicated size 2 min_size 1 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1193 lfor 0/0/85 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.31 pool 2 '.mgr' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 70 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.06 pool 3 'openstack-store.rgw.otp' replicated size 2 min_size 1 crush_rule 11 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.27 pool 4 'openstack-store.rgw.meta' replicated size 2 min_size 1 crush_rule 12 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.50 pool 5 'openstack-store.rgw.log' replicated size 2 min_size 1 crush_rule 10 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.00 pool 6 'openstack-store.rgw.buckets.non-ec' replicated size 2 min_size 1 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.50 pool 7 'openstack-store.rgw.buckets.index' replicated size 2 min_size 1 crush_rule 15 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.25 pool 8 '.rgw.root' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.75 pool 9 'openstack-store.rgw.control' replicated size 2 min_size 1 crush_rule 16 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.00 pool 10 'other-hdd' replicated size 2 min_size 1 crush_rule 19 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1179 lfor 0/0/85 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.69 pool 11 'openstack-store.rgw.buckets.data' erasure profile openstack-store.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 18 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1198 lfor 0/0/86 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw pool 12 'vms-hdd' replicated size 2 min_size 1 crush_rule 21 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 1182 lfor 0/0/95 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.4 application rbd read_balance_score 1.24 pool 13 'volumes-hdd' replicated size 2 min_size 1 crush_rule 23 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 1185 lfor 0/0/89 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.2 application rbd read_balance_score 1.31 pool 14 'backup-hdd' replicated size 2 min_size 1 crush_rule 25 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1188 lfor 0/0/90 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.1 application rbd read_balance_score 2.06 pool 15 'images-hdd' replicated size 2 min_size 1 crush_rule 27 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1191 lfor 0/0/90 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.1 application rbd read_balance_score 1.50
For each replicated Ceph pool:
Obtain the current CRUSH rule name:
ceph osd crush rule dump <oldCrushRuleName>
Create a new CRUSH rule with the required bucket type using the same root, device class, and new bucket type:
ceph osd crush rule create-replicated <newCrushRuleName> <root> <bucketType> <deviceClass>
For example:
ceph osd crush rule create-replicated images-hdd-rack default rack hdd
For more details, refer to official Ceph documentation: CRUSH Maps: Creating a rule for a replicated pool.
Apply a new crush rule to the Ceph pool:
ceph osd pool set <poolName> crush_rule <newCrushRuleName>
For example:
ceph osd pool set images-hdd crush_rule images-hdd-rack
Wait for data to be rebalanced after moving the Ceph pool under the new failure domain (bucket type) by monitoring Ceph health:
ceph -s
Caution
Update the following Ceph pool only after data rebalancing completes for the current Ceph pool.
Verify that the old CRUSH rule is not used anymore:
ceph osd pool ls detail
The rule ID is located in the CRUSH map and must match the rule ID in the output of ceph osd dump.
Remove the old unused CRUSH rule and rename the new one to the original name:
ceph osd crush rule rm <oldCrushRuleName> ceph osd crush rule rename <newCrushRuleName> <oldCrushRuleName>
For each erasure-coded Ceph pool:
Note
Erasure-coded pools require different number of buckets to store data. Instead of the number of replicas in replicated pools, erasure-coded pools require the
coding chunks + data chunks
number of buckets existing in the Ceph cluster. For example, if an erasure-coded pool has 2 coding chunks and 2 data chunks configured, then the pool requires 4 different buckets, for example, 4 racks, to store data.Obtain the current parameters of the erasure-coded profile:
ceph osd erasure-code-profile get <ecProfile>
In the profile, add the new bucket type as the failure domain using the
crush-failure-domain
parameter:ceph osd erasure-code-profile set <ecProfile> k=<int> m=<int> crush-failure-domain=<bucketType> crush-device-class=<deviceClass>
Create a new CRUSH rule in the profile:
ceph osd crush rule create-erasure <newEcCrushRuleName> <ecProfile>
Apply the new CRUSH rule to the pool:
ceph osd pool set <poolName> crush_rule <newEcCrushRuleName>
Wait for data to be rebalanced after moving the Ceph pool under the new failure domain (bucket type) by monitoring Ceph health:
ceph -s
Caution
Update the following Ceph pool only after data rebalancing completes for the current Ceph pool.
Verify that the old CRUSH rule is not used anymore:
ceph osd pool ls detail
The rule ID is located in the CRUSH map and must match the rule ID in the output of ceph osd dump.
Remove the old unused CRUSH rule and rename the new one to the original name:
ceph osd crush rule rm <oldCrushRuleName> ceph osd crush rule rename <newCrushRuleName> <oldCrushRuleName>
Note
New erasure-coded profiles cannot be renamed, so they will not be removed automatically during pools cleanup. Remove them manually, if needed.
Exit the
ceph-tools
pod.
In the management cluster, update the
KaaSCephCluster
object by setting thefailureDomain: rack
parameter for each pool. The configuration from the Rook perspective must match the manually created configuration. For example:spec: cephClusterSpec: pools: - name: images ... failureDomain: rack - name: volumes ... failureDomain: rack ... objectStorage: rgw: dataPool: failureDomain: rack ... metadataPool: failureDomain: rack ...
Monitor the Ceph cluster health and wait until rebalancing is completed:
ceph -s
Example of a successful system response:
HEALTH_OK
Scale back the Ceph Controller and Rook Operator deployments:
kubectl -n ceph-lcm-mirantis scale deploy --all --replicas 3 kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1