Many organizations have business critical processes they rely on to do business. When a critical process is disrupted, the alarms go off and an emergency process gets initiated to remediate the issue and restore business continuity. The emergency process is known as a disaster recovery (DR) process/plan/roadmap/runbook/etc. The complexity and sophistication of the plan can vary greatly depending on the system it is designed for. It is considered a standard practice to design a DR plan following the K.I.S.S. principle (Keep it simple, stupid). In other words, the plan should be easy to follow so that it can be executed without requiring an expert that could be unavailable at that time.
Since containers are expected to be ephemeral, hence well suited for micro service oriented architecture, there are multiple questions that need to be pondered in order to design an adequate disaster recovery plan for a containerized application or a service. This reference architecture aims to provoke thought around various scenarios that can disrupt operation of an application/service or even entire platform and provides some examples of topics that one may consider when building a disaster recovery plan. Some of the topics we’ll discuss are:
This list can go on and on from the top of the application stack all the way down to hardware that crunches zeros and ones. However, the main objective is to provide a few examples in order to encourage thinking about DR plan for a containerized application from different angles.
Throughout the article references to application, app or service are interchangeable.
Abbreviation | Description |
---|---|
MKE | Mirantis Kubernetes Engine |
MSR | Mirantis Secure Registry |
DCT | Docker Content Trust |
DE | Docker Enterprise |
RBAC | Role Based Access Controls |
CI | Continuous Integration |
CD | Continuous Deployment |
HA | High Availability |
DR | Disaster Recovery |
Disaster Recovery is an umbrella plan that encompasses ideas, methods and techniques to minimize time to restore a disrupted system/application/service/etc. Depending on the complexity of the system/application/service the DR plan can span from a small list of instructions to a bundle of documentation, checklists, scripts, runbooks, etc.
The main goal for a DR plan is to restore business continuity as fast as possible. It can mean different things depending on the environment and the part of business that is affected. Production systems/applications have high visibility and typically include a DR plan to restore their operation. While lower level environments (e.g. Staging/Test/Dev/etc.) could be assumed to be less critical, they may still be very important to ensure operation of the business.
Unless your Ops team is ready to own full automation of every change (i.e. app release, platform config change, etc.) applied to any environment, your change flow would likely look somewhat similar to this path:
developer workstation -> Dev env -> Test env -> Integration env -> Staging env -> Production env
In this example applying a change would mean moving it through different lower level environments all the way to production. The disruption at any step in the path can slow down deployment and therefore delay the recovery of normal business operation.
In this example having a DR plan to restore a lower level environment would help to speed up resolution of the issue. It can be as simple as restoring the environment to a last known good state. It could be a plain restore from a backup.
It is up to your organization/team to determine what systems/applications/services need a DR plan and how sophisticated it should be.
Building DR for a container platform such as Docker Enterprise requires us to look at it from several angles. At the foundation of a container platform lays a pool of resources such as CPU, RAM, Disk, Network, etc. (i.e. hardware) that are available for the platform to utilize. At next level there is the platform itself that operates/maintains its state, and schedules containers to run. Then there are containers that host your applications. Hardware (i.e. CPU, RAM, Disk, Network, etc.) DR techniques are outside of the scope of this article. We’ll focus on last two levels: container platform and applications.
When designing a DR plan it’s important to keep in mind that it is designed to be executed as fast as possible to bring your application(s) back online. In other words, build it as simple as possible (i.e. follow K.I.S.S. principle). Automate as many steps as possible. When possible, automate the entire DR process.
Well designed container platforms often rely on CI/CD tools to compile the code, build and sign container images, run tests and deploy app/service to a target environment. In a similar manner a well designed DR plan could employ parts of CI/CD pipeline to restore disrupted app/service. Going forward we’ll use CI/CD references as one of the key tools to automate, and therefore speed up, disrupted service recovery.
Most approaches discussed in this article focus on how to recover disrupted services. However, it’s important to understand the distinction between a platform DR plan and when it could be a hard requirement vs. an app/service DR plan.
The app/service typically has a smaller drag than a container platform when it comes to recovery time. Apps are smaller and often have components that are loosely coupled. Restoring an app to normal operation does not affect other workloads running on the platform. The risk to impact other workloads is minimal.
The platform also consists of multiple components but it is much more complex. A failure/disruption in a platform component can have much broader and higher impact than an application component. An add-on feature, like ingress layer, is often considered to be a part of the platform as it provides ingress access as a feature of the container platform. In a multi-cluster setup it is not unusual to see a smaller sized cluster built for a sole purpose of providing ingress to various types of apps/services running on dedicated nodes or even other clusters. For instance, one may have Nginx or Traefik ingress controller routing traffic to Linux based apps running in Kubernetes and Interlock ingress component routing traffic to Windows based apps running in Swarm. In such example, having a DR plan to recover a disrupted cluster built to provide ingress should be a requirement.
When a container platform shows signs of a failure and it’s not immediately known why, the priority becomes to restore its operation to a known good state. There are at least a few ways to restore operation of a container platform:
Which option to use depends on what suits your organization.
Restoring the platform components from a backup could take some time.
Once the components are restored, you may still need to verify that all
your services are up to date since a backup contains a previously
captured state which is likely to be out of date. Docker Enterprise
platform has 3 main components that may need to be restored: Swarm
,
MKE
, MSR
.
Swarm
orchestrates operation of all workloads running on the
platform. It knows and maintains the state of all members, services,
networks, configs, and secrets. Restoring it from a backup will
instruct the orchestrator to schedule all the services that were
captured at the time of the backup.
Note
A Swarm manager backup must be restored on a node with the same IP address where the backup was initiated.
The MKE
backup captures the state of the control plane
configuration, access control, MKE certificates, organizations,
volumes, and metrics data.
The MSR
backup captures the state of the registry configuration,
repository metadata, access control to repositories and images,
notary data, scan results, and MSR certificates.
You can see that each component maintains multiple different states. Unless your backup/restore process is mostly automated, it can take some time to complete it. Depending on how critical the affected environment is this option could be unacceptable.
For more information on backup and restore topic refer to the Backup and Restore Best Practices success article.
MSR can be repaired in a few ways depending on the issue.
For more details on how to do disaster recovery for MSR, refer to MSR disaster recovery overview.
When the business requirement is to recover from a failure within seconds, then the fastest way could be to failover to another cluster. In a multi-cluster setup there could be a dedicated DR cluster or multiple clusters running workloads in active/active or active/passive modes. There are many details that need to be sorted out in a multi-cluster configuration. Here are a few common items that you should warm manager backup
First challenge is to build a maintenance plan to make sure that all
your clusters do not drift too far apart in terms of their components
versions such as engine
, MKE
, MSR
. Ideally when you
schedule OS patching/updates, you should look into updating the platform
components too. When doing so, make sure you validate the targeted OS
version is supported by the platform. Best way is to consult our
compatibility
matrix.
Other items are discussed later in this article as they require some app related considerations.
Swarm
or Kubernetes
)
is capable of determining the application health status and repairing
(i.e. re-scheduling/re-creating) the application container if the
status is deemed to be unhealthy. While that’s a useful feature to
take advantage of, it’s your responsibility to provide a
healthcheck option for your application.There are various ways applications can be deployed into the container platform depending on their design. The older/legacy apps may be stateful and support a single replica only. A better design may support multiple replicas that could be stateful or stateless. Even better app architecture allows you to deploy the app in HA mode across multiple clusters/regions. A typical customer application portfolio contains many different app designs and therefore may require different DR approaches to be used. It is up to the business and your team to determine which apps are critical and require a DR plan and which don’t.
There is a number of challenges that may come into play when designing DR plans for the container platform and containerized applications. The topics further discussed in this section touch upon some of the most common issues that can surface when designing a DR plan.
It’s worth noting when app/service deployment is discussed in this
article, it refers to deployment of the entire Swarm stack
or
Kubernetes deployment
object. Both terms refer to a desired state
configuration declarative for the app/service that defines or references
necessary dependencies for the app to execute.
Each cluster maintains its own configuration. When designing a DR plan for a container platform and a containerized app, it’s important, as a prerequisite, to have a mechanism to sync up cluster configurations such as access control/RBAC, configs, secrets, networks, volumes, labels, collections and namespaces
The easiest way to keep configuration in sync is to use either CLI or web API to execute the same commands against all clusters. The entire access control configuration can be scripted and deployed onto a cluster in one command.
Every app/service works with data. The app can read and process data, pass-through data, or write data, or all of the above. When moving an app from one cluster to another, it becomes apparent that its data needs to be made accessible from that cluster too. In some cases it can be not trivial to achieve depending on the data storage solution you use. The underlaying storage solution may have its own constraints on how it handles data distribution and failover. The data storage solution could allow only a single instance to be writable and the rest switched into readonly mode. The data failover may require a manual step to expose the data in another cluster.
plugins
and
volumes
are synced up between all clusters.StorageClass
objects and PersistentVolume
objects.Transferring an application into another cluster usually requires a routing change at the load balancer level. Some load balancers allow such changes to be scripted and automated, others may require manual intervention to make the change. Either way, it should be accounted for in the DR plan to have the switch executed when needed.
While both approaches stack/deployment and cluster DR require synchronization of necessary configuration among all clusters, the ways to manage it could be different.
When deploying a stack/deployment, it is possible to include configuration of necessary resources into the deployment task. In that case, you don’t need to pre-create those app dependencies in advance. Resources such as networks, volumes, configs, and secrets can be created during the application deployment. While it is possible, it may not be suitable for your organization’s policies and operations team. It should be discussed as a part of the DR plan design process to determine the best approach for your organization.
The best way to ensure cluster configurations are in sync is to establish a process in which each cluster configuration command would be executed against all clusters. Ideally, each configuration should be scripted and checked into a version control system (e.g. Git). In this case, you can automate the application of the entire cluster configuration and roll it out in one or a few commands.
It is equally important to evaluate the capabilities of the application, platform, and any explicit or implicit dependencies in order to establish a feasible path for your DR plan.
An application that supports high availability is typically configured to be deployed with multiple replicas. However, that doesn’t necessarily mean that any HA app can be implemented in active/active mode across multiple clusters. For instance, if the underlying storage solution allows write operations in a single location (i.e., one cluster), the deployment of the app across multiple clusters may not be feasible. If it is possible, it could be limited to the app being deployed in read-only mode in subsequent clusters. You should evaluate whether it’s possible to switch the storage location in case the primary app instance becomes unavailable. In this case, the functionality to make a read-only instance the new primary would be required.
With multiple clusters where there is no dedicated DR/failover cluster, an active/active approach can be used. Active/active means that an application can be deployed into either cluster. Such setup requires a well designed CI/CD pipeline that could help to ensure the cluster configurations are in sync and quickly deploy the app into the target cluster.
In an active/passive cluster configuration, there is an active cluster that runs all workloads, and there is an idle, passive, cluster waiting to accept workloads. The passive cluster can be configured as either hot or cold. Hot meaning it’s in standby mode waiting to schedule workloads, or cold meaning it’s configured but typically need more time to be ready to schedule workloads. Each option has its pros and cons (e.g., cost, keeping configuration in sync, etc.) which should be evaluated in the context of your business to decide what suits your organization best.
One thing to consider for both types of cluster configuration is to understand what happens in a worst-case scenario. If an entire cluster goes down and all workloads need to be deployed into another cluster, the amount of resources available in the other clusters has to be taken into consideration. If the other clusters are not sized to be able to take over the entire load from the failed cluster, it can also fail under the additional load.
One way to mitigate or even prevent exhaustion of the cluster resources is to ensure all your stacks/deployments set resource reservations and limits. This helps to prevent the scheduler from overpopulating your cluster. The combination of resource reservations, limits, and priorities (in Kubernetes) allows you to build a recovery plan that will make sure your business-critical apps always have room in the cluster.
It is a common practice to configure monitoring and alerting tools to have a better view of the cluster resource utilization. You can leverage alerting tools to give your teams a heads-up when more resources are needed.
When using the Docker Content Trust “Run only signed images” feature,
it’s necessary to understand and manage the metadata. The DCT metadata
is stored in the ${HOME}/.docker/trust
directory on the machine
that uses the DCT commands (i.e. signs container images). When the DCT
“Run only signed images” feature is enabled in your DE cluster, MKE
will not deploy an image that does not meet the configured criteria.
Each cluster maintains its own set of account objects and as such the
signature added by a user from one cluster would not be honored in
another cluster. You need to make sure that your CI process uses the
user key from the same cluster as the MSR it pushes the image to.
One way to simplify signature management is to use the same client bundle for the user that signs the images across all clusters. MKE allows uploading an existing client bundle into a user’s profile. In this case, as long as images are signed by the user using the same client bundle (typically a CI user signs images), the images would be admitted by all clusters that have that client bundle.
For backup/restore reasons or when a containerized CI pipeline is used, it is necessary to store DCT metadata in persistent storage (e.g. container volume) and sensitive pieces in a secure location (e.g. vault).
A disaster recovery plan is a free form approach that may employ many different ideas and techniques to restore business continuity of a critical system/application/service/etc. Not every organization needs to design a disaster recovery plans for their container platform or its applications. Although, every organization should evaluate the necessity to design DR plans. Depending on SLAs for the platform and applications, a strategy to recover from a failure could be a business requirement.
There is no one plan fits all solution when it comes to recovering the platform or services from a failure. All explicit and implicit dependencies should be examined and considered in order to build adequate DR plans.