Designing a Disaster Recovery Strategy

Designing a Disaster Recovery Strategy

Introduction

Many organizations have business critical processes they rely on to do business. When a critical process is disrupted, the alarms go off and an emergency process gets initiated to remediate the issue and restore business continuity. The emergency process is known as a disaster recovery (DR) process/plan/roadmap/runbook/etc. The complexity and sophistication of the plan can vary greatly depending on the system it is designed for. It is considered a standard practice to design a DR plan following the K.I.S.S. principle (Keep it simple, stupid). In other words, the plan should be easy to follow so that it can be executed without requiring an expert that could be unavailable at that time.

What You Will Learn

Since containers are expected to be ephemeral, hence well suited for micro service oriented architecture, there are multiple questions that need to be pondered in order to design an adequate disaster recovery plan for a containerized application or a service. This reference architecture aims to provoke thought around various scenarios that can disrupt operation of an application/service or even entire platform and provides some examples of topics that one may consider when building a disaster recovery plan. Some of the topics we’ll discuss are:

  • cluster DR
  • what you should backup
  • why you should care about data
  • traffic routing
  • stack/deployment vs. cluster DR
  • active/active vs. active/passive considerations

This list can go on and on from the top of the application stack all the way down to hardware that crunches zeros and ones. However, the main objective is to provide a few examples in order to encourage thinking about DR plan for a containerized application from different angles.

Abbreviations

Throughout the article references to application, app or service are interchangeable.

Abbreviation Description
MKE Mirantis Kubernetes Engine
MSR Mirantis Secure Registry
DCT Docker Content Trust
DE Docker Enterprise
RBAC Role Based Access Controls
CI Continuous Integration
CD Continuous Deployment
HA High Availability
DR Disaster Recovery

What is Disaster Recovery

Disaster Recovery is an umbrella plan that encompasses ideas, methods and techniques to minimize time to restore a disrupted system/application/service/etc. Depending on the complexity of the system/application/service the DR plan can span from a small list of instructions to a bundle of documentation, checklists, scripts, runbooks, etc.

When a DR plan is required

The main goal for a DR plan is to restore business continuity as fast as possible. It can mean different things depending on the environment and the part of business that is affected. Production systems/applications have high visibility and typically include a DR plan to restore their operation. While lower level environments (e.g. Staging/Test/Dev/etc.) could be assumed to be less critical, they may still be very important to ensure operation of the business.

Unless your Ops team is ready to own full automation of every change (i.e. app release, platform config change, etc.) applied to any environment, your change flow would likely look somewhat similar to this path:

developer workstation -> Dev env -> Test env -> Integration env -> Staging env -> Production env

In this example applying a change would mean moving it through different lower level environments all the way to production. The disruption at any step in the path can slow down deployment and therefore delay the recovery of normal business operation.

In this example having a DR plan to restore a lower level environment would help to speed up resolution of the issue. It can be as simple as restoring the environment to a last known good state. It could be a plain restore from a backup.

It is up to your organization/team to determine what systems/applications/services need a DR plan and how sophisticated it should be.

Building DR plan for a container platform

Building DR for a container platform such as Docker Enterprise requires us to look at it from several angles. At the foundation of a container platform lays a pool of resources such as CPU, RAM, Disk, Network, etc. (i.e. hardware) that are available for the platform to utilize. At next level there is the platform itself that operates/maintains its state, and schedules containers to run. Then there are containers that host your applications. Hardware (i.e. CPU, RAM, Disk, Network, etc.) DR techniques are outside of the scope of this article. We’ll focus on last two levels: container platform and applications.

When designing a DR plan it’s important to keep in mind that it is designed to be executed as fast as possible to bring your application(s) back online. In other words, build it as simple as possible (i.e. follow K.I.S.S. principle). Automate as many steps as possible. When possible, automate the entire DR process.

Useful tooling to have a nimble DR process

Well designed container platforms often rely on CI/CD tools to compile the code, build and sign container images, run tests and deploy app/service to a target environment. In a similar manner a well designed DR plan could employ parts of CI/CD pipeline to restore disrupted app/service. Going forward we’ll use CI/CD references as one of the key tools to automate, and therefore speed up, disrupted service recovery.

DR approaches for a container platform

Most approaches discussed in this article focus on how to recover disrupted services. However, it’s important to understand the distinction between a platform DR plan and when it could be a hard requirement vs. an app/service DR plan.

The app/service typically has a smaller drag than a container platform when it comes to recovery time. Apps are smaller and often have components that are loosely coupled. Restoring an app to normal operation does not affect other workloads running on the platform. The risk to impact other workloads is minimal.

The platform also consists of multiple components but it is much more complex. A failure/disruption in a platform component can have much broader and higher impact than an application component. An add-on feature, like ingress layer, is often considered to be a part of the platform as it provides ingress access as a feature of the container platform. In a multi-cluster setup it is not unusual to see a smaller sized cluster built for a sole purpose of providing ingress to various types of apps/services running on dedicated nodes or even other clusters. For instance, one may have Nginx or Traefik ingress controller routing traffic to Linux based apps running in Kubernetes and Interlock ingress component routing traffic to Windows based apps running in Swarm. In such example, having a DR plan to recover a disrupted cluster built to provide ingress should be a requirement.

Restore platform operation

When a container platform shows signs of a failure and it’s not immediately known why, the priority becomes to restore its operation to a known good state. There are at least a few ways to restore operation of a container platform:

  • restore platform components from a backup
  • restore underlying virtualized hardware (i.e. VMs) from a snapshot
  • switch over to a known good cluster (a.k.a. failover)

Which option to use depends on what suits your organization.

Platform backup/restore

Restoring the platform components from a backup could take some time. Once the components are restored, you may still need to verify that all your services are up to date since a backup contains a previously captured state which is likely to be out of date. Docker Enterprise platform has 3 main components that may need to be restored: Swarm, MKE, MSR.

  • Swarm orchestrates operation of all workloads running on the platform. It knows and maintains the state of all members, services, networks, configs, and secrets. Restoring it from a backup will instruct the orchestrator to schedule all the services that were captured at the time of the backup.

    Note

    A Swarm manager backup must be restored on a node with the same IP address where the backup was initiated.

  • The MKE backup captures the state of the control plane configuration, access control, MKE certificates, organizations, volumes, and metrics data.

  • The MSR backup captures the state of the registry configuration, repository metadata, access control to repositories and images, notary data, scan results, and MSR certificates.

You can see that each component maintains multiple different states. Unless your backup/restore process is mostly automated, it can take some time to complete it. Depending on how critical the affected environment is this option could be unacceptable.

For more information on backup and restore topic refer to the Backup and Restore Best Practices success article.

MSR disaster recovery options

MSR can be repaired in a few ways depending on the issue.

  • If a single replica is unhealthy, one can replace the unhealthy replica with a new replica.
  • If the majority of MSR replicas are unhealthy, one can remove all but a single healthy replica and then rebuild the other replicas to from the healthy one (also known as emergency repair).
  • In a worst case scenario when all MSR replicas are unhealthy, one can restore MSR from a backup.

For more details on how to do disaster recovery for MSR, refer to MSR disaster recovery overview.

Restore VMs from a snapshot

Systems that provide virtualized hardware capabilities may also have features that allow you to take snapshots/backups of virtual machines. While Docker does not provide direct instructions how to use VM snapshotting for a particular virtualization product, we have customers who successfully leveraged VM snapshots to build cluster backup/restore plan based on this capability.
Depending on the size of the cluster you need to recover, such option could be easier to automate and faster to execute than full backup/restore of Docker Enterprise.

Failover to another cluster

When the business requirement is to recover from a failure within seconds, then the fastest way could be to failover to another cluster. In a multi-cluster setup there could be a dedicated DR cluster or multiple clusters running workloads in active/active or active/passive modes. There are many details that need to be sorted out in a multi-cluster configuration. Here are a few common items that you should warm manager backup

  • how to keep cluster components versions in sync
  • what deployment mode to use (i.e. active/active, active/passive)
  • how to keep resources in sync (i.e. configs, secrets, networks, volumes)
  • how to ensure data availability between clusters

First challenge is to build a maintenance plan to make sure that all your clusters do not drift too far apart in terms of their components versions such as engine, MKE, MSR. Ideally when you schedule OS patching/updates, you should look into updating the platform components too. When doing so, make sure you validate the targeted OS version is supported by the platform. Best way is to consult our compatibility matrix.

Next challenge is to design a plan for the entire cluster failover scenario. While active/active or active/passive deployment modes require you to consider the needs of both platform and application, the entire cluster failover requires more planning around the platform operation. The challenge is to make sure that in a scenario when the main cluster fails over to a DR cluster, the latter has all necessary objects (i.e. configs, secrets, networks, volumes) to support operation of your applications.
Perhaps the easiest way to ensure that your DR cluster has all the necessary objects is to create them in your DR cluster at the same time you create them in your main cluster(s). There are many configuration management tools that can help to standardize and automate this process. Keeping resources in sync between the clusters allows you to restore your workloads via a CD job that re-deploys them into the DR cluster.

Other items are discussed later in this article as they require some app related considerations.

Restore application/service operation

Within a cluster the orchestrator (i.e. Swarm or Kubernetes) is capable of determining the application health status and repairing (i.e. re-scheduling/re-creating) the application container if the status is deemed to be unhealthy. While that’s a useful feature to take advantage of, it’s your responsibility to provide a healthcheck option for your application.
Note, that each application is different and likely to require a specialized healthcheck test/endpoint to be provided for the orchestrator to use effectively.
Refer to docs.docker.com for more details on how to create a healthcheck in Dockerfile or Compose file healthcheck. Remember that Dockerfile healthcheck is built into the image while Compose level healthcheck allows you to setup healthcheck if it is not defined on the image level or override the one that is built into the image.
Kubernetes offers liveness and readiness probes to configure app/service health check and readiness.

There are various ways applications can be deployed into the container platform depending on their design. The older/legacy apps may be stateful and support a single replica only. A better design may support multiple replicas that could be stateful or stateless. Even better app architecture allows you to deploy the app in HA mode across multiple clusters/regions. A typical customer application portfolio contains many different app designs and therefore may require different DR approaches to be used. It is up to the business and your team to determine which apps are critical and require a DR plan and which don’t.

Fortunately, well designed CI/CD pipeline for your apps can help greatly with DR plan execution. Once you design CD tasks to automate deployment of each app into the cluster, it becomes a matter of supplying correct parameters and pointing the deployment to the correct target cluster.
For instance, you may configure a load balancer level health check for your app and rely on it to determine if the app cannot be repaired by the orchestrator. If it fails to report a healthy status after X number of health check queries, your monitoring and alerting system may call a webhook in your CD pipeline triggering steps to remove the app from one cluster and deploy into another.

Issues to consider when building DR plan

There is a number of challenges that may come into play when designing DR plans for the container platform and containerized applications. The topics further discussed in this section touch upon some of the most common issues that can surface when designing a DR plan.

It’s worth noting when app/service deployment is discussed in this article, it refers to deployment of the entire Swarm stack or Kubernetes deployment object. Both terms refer to a desired state configuration declarative for the app/service that defines or references necessary dependencies for the app to execute.

Cluster Configuration

Each cluster maintains its own configuration. When designing a DR plan for a container platform and a containerized app, it’s important, as a prerequisite, to have a mechanism to sync up cluster configurations such as access control/RBAC, configs, secrets, networks, volumes, labels, collections and namespaces

The easiest way to keep configuration in sync is to use either CLI or web API to execute the same commands against all clusters. The entire access control configuration can be scripted and deployed onto a cluster in one command.

Data Storage

Every app/service works with data. The app can read and process data, pass-through data, or write data, or all of the above. When moving an app from one cluster to another, it becomes apparent that its data needs to be made accessible from that cluster too. In some cases it can be not trivial to achieve depending on the data storage solution you use. The underlaying storage solution may have its own constraints on how it handles data distribution and failover. The data storage solution could allow only a single instance to be writable and the rest switched into readonly mode. The data failover may require a manual step to expose the data in another cluster.

In Swarm it’s necessary to make sure that any storage plugins and volumes are synced up between all clusters.
In Kubernetes it’s important to sync up any non-default StorageClass objects and PersistentVolume objects.

Traffic Routing

Transferring an application into another cluster usually requires a routing change at the load balancer level. Some load balancers allow such changes to be scripted and automated, others may require manual intervention to make the change. Either way, it should be accounted for in the DR plan to have the switch executed when needed.

Stack/Deployment vs. Cluster DR

While both approaches stack/deployment and cluster DR require synchronization of necessary configuration among all clusters, the ways to manage it could be different.

When deploying a stack/deployment, it is possible to include configuration of necessary resources into the deployment task. In that case, you don’t need to pre-create those app dependencies in advance. Resources such as networks, volumes, configs, and secrets can be created during the application deployment. While it is possible, it may not be suitable for your organization’s policies and operations team. It should be discussed as a part of the DR plan design process to determine the best approach for your organization.

The best way to ensure cluster configurations are in sync is to establish a process in which each cluster configuration command would be executed against all clusters. Ideally, each configuration should be scripted and checked into a version control system (e.g. Git). In this case, you can automate the application of the entire cluster configuration and roll it out in one or a few commands.

Active/Active vs. Active/Passive Considerations

It is equally important to evaluate the capabilities of the application, platform, and any explicit or implicit dependencies in order to establish a feasible path for your DR plan.

The App View

An application that supports high availability is typically configured to be deployed with multiple replicas. However, that doesn’t necessarily mean that any HA app can be implemented in active/active mode across multiple clusters. For instance, if the underlying storage solution allows write operations in a single location (i.e., one cluster), the deployment of the app across multiple clusters may not be feasible. If it is possible, it could be limited to the app being deployed in read-only mode in subsequent clusters. You should evaluate whether it’s possible to switch the storage location in case the primary app instance becomes unavailable. In this case, the functionality to make a read-only instance the new primary would be required.

In active/passive configuration an app can be pre-deployed into another cluster but have zero replicas configured. In this case, all resources for the app deployment are created and the app definition is deployed but it has no active instance running. Depending on how complex your deployment is, this can shave a few seconds off the total time needed to get a running instance of the app in a different cluster.
On the other hand, if a few seconds do not make a significant difference for your business, it could be easier to trigger a CD task to deploy the app into another cluster and keep the pipeline as simple as possible.

The Cluster View

With multiple clusters where there is no dedicated DR/failover cluster, an active/active approach can be used. Active/active means that an application can be deployed into either cluster. Such setup requires a well designed CI/CD pipeline that could help to ensure the cluster configurations are in sync and quickly deploy the app into the target cluster.

In an active/passive cluster configuration, there is an active cluster that runs all workloads, and there is an idle, passive, cluster waiting to accept workloads. The passive cluster can be configured as either hot or cold. Hot meaning it’s in standby mode waiting to schedule workloads, or cold meaning it’s configured but typically need more time to be ready to schedule workloads. Each option has its pros and cons (e.g., cost, keeping configuration in sync, etc.) which should be evaluated in the context of your business to decide what suits your organization best.

One thing to consider for both types of cluster configuration is to understand what happens in a worst-case scenario. If an entire cluster goes down and all workloads need to be deployed into another cluster, the amount of resources available in the other clusters has to be taken into consideration. If the other clusters are not sized to be able to take over the entire load from the failed cluster, it can also fail under the additional load.

One way to mitigate or even prevent exhaustion of the cluster resources is to ensure all your stacks/deployments set resource reservations and limits. This helps to prevent the scheduler from overpopulating your cluster. The combination of resource reservations, limits, and priorities (in Kubernetes) allows you to build a recovery plan that will make sure your business-critical apps always have room in the cluster.

It is a common practice to configure monitoring and alerting tools to have a better view of the cluster resource utilization. You can leverage alerting tools to give your teams a heads-up when more resources are needed.

Image Signatures in a DR Scenario

When using the Docker Content Trust “Run only signed images” feature, it’s necessary to understand and manage the metadata. The DCT metadata is stored in the ${HOME}/.docker/trust directory on the machine that uses the DCT commands (i.e. signs container images). When the DCT “Run only signed images” feature is enabled in your DE cluster, MKE will not deploy an image that does not meet the configured criteria. Each cluster maintains its own set of account objects and as such the signature added by a user from one cluster would not be honored in another cluster. You need to make sure that your CI process uses the user key from the same cluster as the MSR it pushes the image to.

One way to simplify signature management is to use the same client bundle for the user that signs the images across all clusters. MKE allows uploading an existing client bundle into a user’s profile. In this case, as long as images are signed by the user using the same client bundle (typically a CI user signs images), the images would be admitted by all clusters that have that client bundle.

For backup/restore reasons or when a containerized CI pipeline is used, it is necessary to store DCT metadata in persistent storage (e.g. container volume) and sensitive pieces in a secure location (e.g. vault).

Summary

A disaster recovery plan is a free form approach that may employ many different ideas and techniques to restore business continuity of a critical system/application/service/etc. Not every organization needs to design a disaster recovery plans for their container platform or its applications. Although, every organization should evaluate the necessity to design DR plans. Depending on SLAs for the platform and applications, a strategy to recover from a failure could be a business requirement.

There is no one plan fits all solution when it comes to recovering the platform or services from a failure. All explicit and implicit dependencies should be examined and considered in order to build adequate DR plans.