Restore Tungsten Fabric data

This section describes how to restore the Cassandra and ZooKeeper databases from the backups created either automatically or manually as described in Back up TF databases.

Caution

The data backup must be consistent across all systems because the state of the Tungsten Fabric databases is associated with other system databases, such as OpenStack databases.

Automatically restore the data

  1. Verify that the cluster does not have the tfdbrestore object. If there is still one remaining from the previous restoration, delete it:

    kubectl -n tf delete tfdbrestores.tf.mirantis.com tf-dbrestore
    
    kubectl -n tf delete tfdbrestores.tf-dbrestore.tf.mirantis.com tf-dbrestore
    
  2. Edit the TFOperator custom resource to perform the data restoration:

    spec:
      features:
        dbRestoreMode:
          enabled: true
    
    spec:
      settings:
        dbRestoreMode:
          enabled: true
    

    Warning

    When restoring the data, MOSK stops the Tungsten Fabric services and recreates the database backends that include Cassandra, Kafka, and ZooKeeper.

    Note

    The automated restoration process relies on automated database backups configured by the Tungsten Fabric Operator. The Tungsten Fabric data is restored from the backup type specified in the tf-dbBackup section of the Tungsten Fabric Operator custom resource, or the default pvc type if not specified. For the configuration details, refer to Periodic Tungsten Fabric database backups.

  3. Optional. Specify the name of the backup to be used for the dbDumpName parameter. By default, the latest db-dump is used.

    spec:
      features:
        dbRestoreMode:
          enabled: true
          dbDumpName: db-dump-20220111-110138.json
    
    spec:
      settings:
        dbRestoreMode:
          enabled: true
          dbDumpName: db-dump-20220111-110138.json
    
  4. To verify the restoration status and stage, verify the events recorded for the tfdbrestore object:

    kubectl -n tf describe tfdbrestores.tf.mirantis.com tf-dbrestore
    
    kubectl -n tf describe tfdbrestores.tf-dbrestore.tf.mirantis.com
    

    Example of a system response:

    ...
    Status:
       Health:  Ready
    Events:
       Type    Reason                       Age                From          Message
       ----    ------                       ----               ----          -------
       Normal  TfDaemonSetsDeleted          18m (x4 over 18m)  tf-dbrestore  TF DaemonSets were deleted
       Normal  zookeeperOperatorScaledDown  18m                tf-dbrestore  zookeeper operator scaled to 0
       Normal  zookeeperStsScaledDown       18m                tf-dbrestore  tf-zookeeper statefulset scaled to 0
       Normal  cassandraOperatorScaledDown  17m                tf-dbrestore  cassandra operator scaled to 0
       Normal  cassandraStsScaledDown       17m                tf-dbrestore  tf-cassandra-config-dc1-rack1 statefulset scaled to 0
       Normal  cassandraStsPodsDeleted      16m                tf-dbrestore  tf-cassandra-config-dc1-rack1 statefulset pods deleted
       Normal  cassandraPVCDeleted          16m                tf-dbrestore  tf-cassandra-config-dc1-rack1 PVC deleted
       Normal  zookeeperStsPodsDeleted      16m                tf-dbrestore  tf-zookeeper statefulset pods deleted
       Normal  zookeeperPVCDeleted          16m                tf-dbrestore  tf-zookeeper PVC deleted
       Normal  kafkaOperatorScaledDown      16m                tf-dbrestore  kafka operator scaled to 0
       Normal  kafkaStsScaledDown           16m                tf-dbrestore  tf-kafka statefulset scaled to 0
       Normal  kafkaStsPodsDeleted          16m                tf-dbrestore  tf-kafka statefulset pods deleted
       Normal  AllOperatorsStopped          16m                tf-dbrestore  All 3rd party operator's stopped
       Normal  CassandraOperatorScaledUP    16m                tf-dbrestore  CassandraOperator  scaled to 1
       Normal  CassandraStsScaledUP         16m                tf-dbrestore  Cassandra statefulset scaled to 3
       Normal  CassandraPodsActive          12m                tf-dbrestore  Cassandra pods active
       Normal  ZookeeperOperatorScaledUP    12m                tf-dbrestore  Zookeeper Operator  scaled to 1
       Normal  ZookeeperStsScaledUP         12m                tf-dbrestore  Zookeeper Operator  scaled to 3
       Normal  ZookeeperPodsActive          12m                tf-dbrestore  Zookeeper pods  active
       Normal  DBRestoreFinished            12m                tf-dbrestore  TF db restore finished
       Normal  TFRestoreDisabled            12m                tf-dbrestore  TF Restore disabled
    

    Note

    If the restoration was completed several hours ago, events may not be shown with kubectl describe. If so, verify the Status field and get events using the following command:

    kubectl -n tf get events --field-selector involvedObject.name=tf-dbrestore
    
  5. After the job completes, it can take around 15 minutes to stabilize tf-control services. If some pods are still in the CrashLoopBackOff status, restart these pods manually one by one:

    1. List the tf-control pods:

      kubectl -n tf get pods -l app=tf-control
      
    2. Verify that the new pods are successfully spawned.

    3. Verify that no vRouters are connected to only the tf-control pod that will be restarted.

    4. Restart the tf-control pods sequentially:

      kubectl -n tf delete pod tf-control-<hash>
      

    When the restoration completes, MOSK automatically sets dbRestoreMode to false in the Tungsten Fabric Operator custom resource.

  6. Delete the tfdbrestore object from the cluster to be able to perform the next restoration:

    kubectl -n tf delete tfdbrestores.tf.mirantis.com tf-dbrestore
    
    kubectl -n tf delete tfdbrestores.tf-dbrestore.tf.mirantis.com tf-dbrestore
    

Manually restore the data

  1. Obtain the config API image repository and tag.

    kubectl -n tf get tfconfig tf-config -o=jsonpath='{.spec.api.containers[?(@.name=="api")].image}'
    

    From the output, copy the entire image link.

  2. Terminate the configuration and analytics services and stop the database changes associated with northbound APIs on all systems.

    Note

    The Tungsten Fabric Operator watches related resources and keeps them updated and healthy. If any resource is deleted or changed, the Tungsten Fabric Operator automatically runs reconciling to create a resource or change the configuration back to the required state. Therefore, the Tungsten Fabric Operator must not be running during the data restoration.

    1. Scale the tungstenfabric-operator deployment to 0 replicas:

      kubectl -n tf scale deploy tungstenfabric-operator --replicas 0
      
    2. Verify the number of replicas:

      kubectl -n tf get deploy tungstenfabric-operator
      

      Example of a positive system response:

      NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
      tungstenfabric-operator   0/0     0            0           10h
      
    3. Delete the Tungsten Fabric configuration and analytics DaemonSets:

      kubectl -n tf delete daemonset tf-config
      kubectl -n tf delete daemonset tf-config-db
      kubectl -n tf delete daemonset tf-analytics
      kubectl -n tf delete daemonset tf-analytics-snmp
      

      The Tungsten Fabric configuration pods should be automatically terminated.

    4. Verify that the Tungsten Fabric configuration pods are terminated:

      kubectl -n tf get pod -l app=tf-config
      kubectl -n tf get pod -l tungstenfabric=analytics
      kubectl -n tf get pod -l tungstenfabric=analytics-snmp
      

      Example of a positive system response:

      No resources found.
      
  3. Stop Kafka:

    1. Scale the kafka-operator deployment to 0 replicas:

      kubectl -n tf scale deploy kafka-operator --replicas 0
      
    2. Scale the tf-kafka statefulSet to 0 replicas:

      kubectl -n tf scale sts tf-kafka --replicas 0
      
    3. Verify the number of replicas:

      kubectl -n tf get sts tf-kafka
      

      Example of a positive system response:

      NAME       READY   AGE
      tf-kafka   0/0     10h
      
  4. Stop and wipe the Cassandra database:

    1. Scale the cassandra-operator deployment to 0 replicas:

      kubectl -n tf scale deploy cassandra-operator --replicas 0
      
    2. Scale the tf-cassandra-config-dc1-rack1 statefulSet to 0 replicas:

      kubectl -n tf scale sts tf-cassandra-config-dc1-rack1 --replicas 0
      
    3. Verify the number of replicas:

      kubectl -n tf get sts tf-cassandra-config-dc1-rack1
      

      Example of a positive system response:

      NAME                            READY   AGE
      tf-cassandra-config-dc1-rack1   0/0     10h
      
    4. Delete Persistent Volume Claims (PVCs) for the Cassandra configuration pods:

      kubectl -n tf delete pvc -l app=cassandracluster,cassandracluster=tf-cassandra-config
      

      Once PVCs are deleted, the related Persistent Volumes are automatically released. The release process takes approximately one minute.

  5. Stop and wipe the ZooKeeper database:

    1. Scale the zookeeper-operator deployment to 0 replicas:

      kubectl -n tf scale deploy zookeeper-operator --replicas 0
      
    2. Scale the tf-zookeeper statefulSet to 0 replicas:

      kubectl -n tf scale sts tf-zookeeper --replicas 0
      
    3. Verify the number of replicas:

      kubectl -n tf get sts tf-zookeeper
      

      Example of a positive system response:

      NAME           READY   AGE
      tf-zookeeper   0/0     10h
      
    4. Delete PVCs for the ZooKeeper configuration pods:

      kubectl -n tf delete pvc -l app=tf-zookeeper
      

      Once PVCs are deleted, the related Persistent Volumes are automatically released. The release process takes approximately one minute.

  6. Restore the number of replicas to run Cassandra and ZooKeeper and restore the deleted PVCs.

    1. Restore the cassandra-operator deployment replicas:

      kubectl -n tf scale deploy cassandra-operator --replicas 1
      
    2. Restore the tf-cassandra-config-dc1-rack1 statefulSet replicas:

      kubectl -n tf scale sts tf-cassandra-config-dc1-rack1 --replicas 3
      
    3. Verify that Cassandra pods have been created and are running:

      kubectl -n tf get pod -l app=cassandracluster,cassandracluster=tf-cassandra-config
      

      Example of a positive system response:

      NAME                              READY   STATUS    RESTARTS   AGE
      tf-cassandra-config-dc1-rack1-0   1/1     Running   0          4m43s
      tf-cassandra-config-dc1-rack1-1   1/1     Running   0          3m30s
      tf-cassandra-config-dc1-rack1-2   1/1     Running   0          2m6s
      
    4. Restore the zookeeper-operator deployment replicas:

      kubectl -n tf scale deploy zookeeper-operator --replicas 1
      
    5. Restore the tf-zookeeper statefulSet replicas:

      kubectl -n tf scale sts tf-zookeeper --replicas 3
      
    6. Verify that ZooKeeper pods have been created and are running:

      kubectl -n tf get pod -l app=tf-zookeeper
      

      Example of a positive system response:

      NAME             READY   STATUS    RESTARTS   AGE
      tf-zookeeper-0   1/1     Running   0          3m23s
      tf-zookeeper-1   1/1     Running   0          2m56s
      tf-zookeeper-2   1/1     Running   0          2m20s
      
  7. Restore the data from the backup:

    Note

    Do not use the Tungsten Fabric API container used for the backup file creation. In this case, a session with the Cassandra and ZooKeeper databases is created once the Tungsten Fabric API service starts but the Tungsten Fabric configuration services are stopped. The tools for the data backup and restore are available only in the Tungsten Fabric configuration API container. Using the steps below, start a blind container based on the config-api image.

    1. Deploy a pod using the configuration API image obtained in the first step:

      Note

      Since MOSK 24.1, if your deployment uses the cql Cassandra driver, update the value of the CONFIGDB_CASSANDRA_DRIVER environment variable to cql.

      cat <<EOF | kubectl apply -f -
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          kubernetes.io/psp: privileged
        labels:
          app: tf-restore-db
        name: tf-restore-db
        namespace: tf
      spec:
        containers:
          - name: api
            image: <PUT_LINK_TO_CONFIG_API_IMAGE_FROM_STEP_ABOVE>
            command:
              - sleep
              - infinity
            envFrom:
              - configMapRef:
                  name: tf-rabbitmq-cfgmap
              - configMapRef:
                  name: tf-zookeeper-cfgmap
              - configMapRef:
                  name: tf-cassandra-cfgmap
              - configMapRef:
                  name: tf-services-cfgmap
              - secretRef:
                  name: tf-os-secret
            env:
            - name: CONFIGDB_CASSANDRA_DRIVER
              value: thrift
            imagePullPolicy: Always
        nodeSelector:
          tfcontrol: enabled
        dnsPolicy: ClusterFirstWithHostNet
        enableServiceLinks: true
        hostNetwork: true
        priority: 0
        restartPolicy: Always
        serviceAccount: default
        serviceAccountName: default
      EOF
      

      If you use the backup that was created automatically, extend the YAML file content above with the following configuration:

      ...
      spec:
        containers:
        - name: api
          volumeMounts:
            - mountPath: </PATH/TO/MOUNT>
              name: <TF-DBBACKUP-VOL-NAME>
        volumes:
          - name: <TF-DBBACKUP-VOL-NAME>
            persistentVolumeClaim:
              claimName: <TF-DBBACKUP-PVC-NAME>
      
    2. Copy the database dump to the container:

      Warning

      Skip this step if you use the auto-backup and have provided the volume definition as described above.

      kubectl cp <PATH_TO_DB_DUMP> tf/tf-restore-db:/tmp/db-dump.json
      
    3. Copy the contrail-api.conf file to the container:

      kubectl cp <PATH-TO-CONFIG> tf/tf-restore-db:/tmp/contrail-api.conf
      
    4. Join the restarted container:

      kubectl -n tf exec -it tf-restore-db -- bash
      
    5. Restore the Cassandra database from the backup:

      (config-api) $ cd /usr/lib/python2.7/site-packages/cfgm_common
      (config-api) $ python db_json_exim.py --import-from /tmp/db-dump.json --api-conf /tmp/contrail-api.conf
      
    6. Delete the restore container:

      kubectl -n tf delete pod tf-restore-db
      
  8. Restore the replica number to run Kafka:

    1. Restore the kafka-operator deployment replicas:

      kubectl -n tf scale deploy kafka-operator --replicas 1
      

      Kafka operator should automatically restore the number of replicas of the appropriate StatefulSet.

    2. Verify the number of replicas:

      kubectl -n tf get sts tf-kafka
      

      Example of a positive system response:

      NAME       READY   AGE
      tf-kafka   3/3     10h
      
  9. Run the Tungsten Fabric Operator to restore the Tungsten Fabric configuration and analytics services:

    1. Restore the replica for the Tungsten Fabric Operator Deployment:

      kubectl -n tf scale deploy tungstenfabric-operator --replicas 1
      
    2. Verify that the Tungsten Fabric Operator is running properly without any restarts:

      kubectl -n tf get pod -l name=tungstenfabric-operator
      
    3. Verify that the configuration pods have been automatically started:

      kubectl -n tf get pod -l app=tf-config
      kubectl -n tf get pod -l tungstenfabric=analytics
      kubectl -n tf get pod -l tungstenfabric=analytics-snmp
      
  10. Restart the tf-control services:

    Caution

    To avoid network downtime, do not restart all pods simultaneously.

    1. List the tf-control pods

      kubectl -n tf get pods -l app=tf-control
      
    2. Restart the tf-control pods one by one.

      Caution

      Before restarting the tf-control pods:

      • Verify that the new pods are successfully spawned.

      • Verify that no vRouters are connected to only one tf-control pod that will be restarted.

      kubectl -n tf delete pod tf-control-<hash>