Running MKE, MSR, and MCR at Scale

Running MKE, MSR, and MCR at Scale


This Mirantis Reference Architecture will help you plan large-scale MKE, MSR, and MCR platform deployments, offering guidance in sizing hardware and infrastructure for system deployments and in determining optimal workload configurations.

What You Will Learn

For the three software components (MKE, MSR, and MCR), this guide covers:

  • Use case parameters that are likely to drive scale requirements

  • Known scale limits based on real-world tests

  • Best practices to ensure good performance and future headroom for growth

Configuring MKE and MCR

Configuration factors that apply to the optimization of MKE and MCR include:

  • Number of Managers

  • Manager Size and Type

  • Worker Node Size and

  • Segmenting Tasks and Limiting Resource Use

  • Resource Constraints

  • Disk Space

  • Container Images on Worker Nodes

  • Logs

  • Overlay Networking and Routing Mesh

  • HTTP Routing Mesh

Number of Managers

The recommended number of managers for a production cluster is 3 or 5. A 3-manager cluster can tolerate the loss of one manager, and a 5-manager cluster can tolerate two instantaneous manager failures. Clusters with more managers can tolerate more manager failures, but adding more managers also increases the overhead of maintaining and committing cluster state in the Docker Swarm Raft quorum. In some circumstances, clusters with more managers (for example 5 or 7) may be slower (in terms of cluster-update latency and throughput) than a cluster with 3 managers and otherwise similar specs.

In general, increasing the manager count does not make cluster operations faster (it may make them slower in some circumstances), does not increase the max cluster update operation throughput, and does not increase the total number of worker nodes that the cluster can manage.

Even when managers are down and there’s no quorum, services and tasks on the cluster keep running and are steady-state stable (although updating cluster state is not possible without quorum). For that reason, Docker recommends investing in quickly recovering from individual manager failures (e.g. automation/scripts for quickly adding replacement managers) rather than planning clusters with a large number of managers.

1-manager clusters should only be used for testing and experimentation since loss of the manager will cause cluster loss.

See also

Check out the documentation on how manager and worker nodes work.

Manager Size and Type

Managers in a production cluster should ideally have at least 16GB of RAM and 4 vCPUs. Testing done by Docker has shown that managers with 16GB RAM are not memory constrained, even in clusters with 100s of workers and many services, networks, and other metadata.

Managers in production clusters should always use SSDs for the /var/lib/docker/swarm mount point. Docker stores swarm cluster state in this directory and will read and write many small updates as cluster state is updated. SSDs ensure that updates can be committed with minimal latency. SSDs are also recommended for clusters used for test and experimentation to ensure good performance.

Increasing CPU speed and count and improving network latency between manager nodes will also improve cluster performance.

Worker Nodes Size and Count

For worker nodes, the overhead of Docker components and agents is not large — typically less than 1GB of memory. Deciding worker size and count can be done similar to how you currently size app or VM environments. For example, you can determine the app memory working set under load and factor in how many replicas you want for each app (for durability in case of task failure and/or for throughput). That will give you an idea of the total memory required across workers in the cluster.

Remember that Docker Swarm automatically reschedules tasks in case of worker node failure (or if you drain a node for upgrade or servicing), so don’t forget to leave headroom to handle tasks being rebalanced to other nodes.

Also remember that, unlike virtual machines, Docker containers add little or no memory or CPU overhead compared to running an app outside of a container. If you’re moving apps from individual VMs into containers, or if you’re consolidating many apps into an MKE cluster, you should be able to do so with fewer resources than you’re currently using.

Segmenting Tasks and Limiting Resource Use

On production clusters, restrict the running of workloads to worker nodes.


Workloads should never be run on manager nodes in a production cluster.

If the tasks and services deployed on your cluster have very different resource profiles and if you want to use different node types for different tasks (for example with different disk, memory, or CPU characteristics) you can use node labels and service constraints to control where Swarm schedules tasks for a particular service.

You can also put nodes into collections and control access based on user accounts and teams. This is useful for isolating tasks managed by teams or individuals that are prone to deploying apps that consume many resources or exhibit other noisy neighbor characteristics that negatively affect tasks run by other teams. See the RBAC Knowledge Base article for examples of how to structure teams and projects that use MKE, MSR, and MCR.

Resource Constraints

MKE and MCR have support for applying resource limits to containers and service tasks. Docker recommends using the --reserve-memory=<value> and --limit-memory=<value> parameters when creating services. These let MKE and MCR better pack tasks on worker nodes based on expected memory consumption.

Further, it might be a good idea to allocate a global (1 instance per node) “ghost” service that reserves a chunk (for example 2GB) of memory on each node that can be used by non-Docker system services. This is relevant because Docker Swarm does not currently account for worker node memory consumed by workloads not managed by Docker:

docker service create \
  --name system-reservation \
  --reserve-memory 2G \
  --limit-memory 2G \
  --reserve-cpu 1 \
  --mode global \

(nginx does not actually do any work in this service. Any small image that does not consume a lot of memory or CPU can be used instead of nginx).

Disk Space

For production clusters, there are few factors that drive worker disk space use that you should look out for:

  • In-use Docker container images on worker nodes

  • Local Docker volumes created for containers

  • Container logs stored on worker nodes

  • Mirantis Container Runtime logs stored on worker nodes

  • Temporary scratch data written by containers

Container Images on Worker Nodes

To determine how much space to allocate for in-use images, try putting some of your apps in containers and see how big the resulting images are. Note that Docker images consist of layers, and if the same layer is used by multiple containers (as is common of OS layers like ubuntu or language framework layers like openjdk), only one copy of that layer is stored and used on any given node or Mirantis Secure Registry. Layer sharing also means that deploying a new version of your app typically only consumes a relatively small amount of extra space on nodes (since only the top layers that hold your app are changed).

Note that Docker Windows container images often end up being somewhat larger than Linux ones.

To keep in-use container image storage in check, try to ensure that app images derive from common base images. Also consider running regular scripts or cron-jobs to prune unused images, especially if nodes handle many image update (e.g. build servers or test systems that see more frequent deploys). See the docs on image-pruning for details.


For production clusters, Docker recommends aggregating container logs using a logging driver or other third party service. Only the json-file (and possibly journald) log drivers cause container logs to accumulate on nodes, and in that case, care should be taken to rotate or remove old container logs. See Logging Design and Best Practices for details.

Mirantis Container Runtime logs are stored on worker and manager nodes. The amount of MCR logs generated varies with workload and MCR settings. For example, debug log level causes more logs to be written. MCR logs should be managed (compacted and eventually deleted) with a utility like logrotate.

Overlay Networking and Routing Mesh

MKE and MCR ship with a built-in, supported overlay networking driver for multi-host networking for use with Docker Swarm. Overlay networking incurs overhead associated with encapsulating network traffic and with managing IP addresses and other metadata that tracks networked tasks and services.

Customers that have apps with high network throughput requirements or workloads with high-frequency cluster or service updates should consider minimizing reliance on the out-of-the-box Docker overlay networking and routing mesh. There are several ways to do that:

  • Use host-mode publishing instead of routing mesh

  • Use the macvlan driver, which may have better performance than the default driver

  • Use a non-Docker service discovery mechanism (like Consul)

  • Consider using dnsrr instead of vip service endpoints

Overlay network size should not exceed /24 blocks (the default) with 256 IP addresses when networks are used by services created using VIP-based endpoint-mode (the default). Users should not work around this by increasing the IP block size. Instead, either use dnsrr endpoint-mode or use multiple smaller overlay networks.

Also be aware that you may experience IP exhaustion if many tasks are assigned to a single overlay network. This may occur, for example, if many services are attached to the network or if services on the network are scaled to many replicas. The problem may also manifest when tasks are rescheduled because of node failures. In case of node failure, Docker currently waits 24 hours to release overlay IP addresses. The problem can be diagnosed by looking for failed to allocate network IP for task messages in the Docker daemon logs.

HTTP Routing Mesh

MKE and MCR come with a built-in HTTP Routing Mesh feature. The HTTP Routing Mesh adds some overhead from extra network hops and routing control and should only be used to manage networking for externally-exposed services. For networking and routing between services hosted on Docker, simply use the standard built-in Docker overlay networking for best performance.

Mirantis Secure Registry

This section covers configuration of Mirantis Secure Registry for scale and performance.

Storage Driver

Mirantis Secure Registry supports a wide range of storage backends. For scaling purposes, backend types can be classified either as filesystem-based (NFS, bind mount, volume) or cloud/blob-based (AWS S3, Swift, Azure Blob Storage, Google Cloud Storage).

For some uses, cloud/blob-based storage are more performant than filesystem-based storage. This is because MSR can redirect layer GET requests from clients directly to the backing store. By doing this the actual image contents being pulled by Docker clients won’t have to transit through MSR but can be fetched directly by Docker clients from the backing store (once metadata has been fetched and credentials checked by MSR).

When using filesystem-based storage (like NFS), ensure that MSR performance is not constrained by infrastructure. Common bottlenecks include host network interface cards, the load balancer deployed with MSR, throughput (IOPS) and latency of the backend storage system, and the CPU/memory of the MSR replica hosts.

Docker has tested MSR performance and determined that it can handle in excess of 1400 concurrent pulls of 1 GB container images using NFS-backed storage with 3 replicas.

Total Storage

The best way to understand future total image storage requirements is to gather and analyze the following data:

  • Average image size

  • Frequency of image updates/pushes

  • Average size of image update size (considering that many images may share common base layers)

  • Retention policies and requirements for storing old image artifacts

Use Mirantis Secure Registry Garbage Collection in combination with scripts or other automation that delete old images (using the MSR API) to keep storage use in check.

Storage Performance

The Mirantis Secure Registry write-load is likely to be high when many developers or build machines are pushing images to MSR at the same time.

The read-load is likely to be high when a new image version is pushed to MSR and deployed to an MKE cluster with many instances all pulling the updated image.

If the same MSR cluster instance is used for both developer/build-server artifact storage and for production image artifact storage for a large production MKE cluster, the MSR cluster instances will experience both high write and read load. For very large deployments consider using two (or more) MSR clusters - one focused on supporting developers and build-servers writing images and another one that can handle very high instantaneous read loads generated by production deployments.

When estimating MSR performance requirements, consider average image and image update sizes, how many developers and build machines will be pushing and pulling from your MSR setup, and how many production nodes will concurrently pull updated images during deploys. Ensure that you have enough MSR instances and that your backing storage has enough read and write throughput to handle peak load.

To increase image pull throughput, consider using MSR caches as an alternative to adding more replicas.

Replica Count

Mirantis Secure Registry maintains a quorum of replicas that store metadata about repos, images, tags, and other MSR objects. 3 replicas is the minimum number of replicas for a highly available deployment. 1-replica deployments should only be used for testing and experimentation.

When using multiple MSR replicas, configure a loadbalancer so that requests are distributed to all MSR replicas.

A MSR cluster with 5 or 7 replicas may take longer to commit metadata updates (such as image pushes or tag updates) than one with 3 replicas because it takes longer for updates to propagate with a larger quorum.

If using MSR Security Scanning, note that MSR will run at most one concurrent scan per MSR replica. Adding more MSR replicas (or changing to replicas with faster hardware) will increase MSR scanning throughput. Note that MSR does not currently re-scan stored images when the vulnerability database is updated. Backlogs of queued scans are most likely to result from lots of images being updated.

In summary, you may want to consider using more than 3 MSR replicas to achieve:

  • Peak image push/pull throughput on NFS-based setups in excess of what 3 replicas can handle

  • More than 3 concurrent image vulnerability scans

  • Durability in case of more than 1 instantaneous MSR replica failure

Metadata Size and Cluster Operations

Mirantis Secure Registry stores metadata about repos, images, tags, and other objects in a database (user data is maintained by Universal Control Plane). You can determine the size of the MSR database by checking the size of the /data directory in the dtr-rethink container.

The time required to complete MSR cluster operations such as replica-join, backup, and restore is determined by the amount of metadata held by MSR.

Cluster Size

If you’re planning a large deployment for use by multiple groups or business units, you should consider minimizing the number of clusters you will use. It is possible you will use a separate cluster for each business unit, but you will typically benefit from using as few clusters as possible.

MKE has strong team-based multi-tenancy controls, including the ability to have collections of worker nodes run only tasks and services created by specific teams. Using these features with a minimal number of clusters will let multiple business units or groups use MKE without the overhead you’ll incur by configuring and operating separate clusters for each business unit.

Even so, there are good reasons to consider using a greater number of clusters:

  • Stages: Even for smaller deployments it is a good idea to have separate non-production and production clusters. This allows critical changes and updates to be tested in isolation before deploying in production. More granular segmentation can be done if there are many stages (Test, QA, Staging, Prod, etc).

  • Team or Application separation: team-based multi-tenancy controls allow multiple apps to be safely run in the same cluster. However, more stringent security requirements may necessitate using separate clusters.

  • Region: Regional redundancy, compliance laws, or latency can all be reasons to segment workloads into multiple clusters.

The same concerns apply when planning how many MSR clusters to use. Note that you are currently limited to a 1:1 mapping between MKE and MSR cluster instances. That said, multiple MKE clusters can share a single MSR cluster with some feature limitations.


Planning your cluster deployment with scaling in mind will help maintain optimal performance and adequate disk space, and it will allow you to perform upgrades with little to no downtime.

See also

In planning large-scale MKE, MSR, and MCR platform deployments, also consider the recommendations offered in Docker Enterprise Best Practices and Design Considerations.