This Mirantis Reference Architecture will help you plan large-scale MCR/MKE/MSR platform deployments, offering guidance in sizing hardware and infrastructure for system deployments and in determining optimal workload configurations.
For the three software components (MCR, MKE, and MSR), this guide covers:
Use case parameters that are likely to drive scale requirements
Known scale limits based on real-world tests
Best practices to ensure good performance and future headroom for growth
Configuration factors that apply to the optimization of MCR and MKE include:
Number of Managers
Manager Size and Type
Worker Node Size and
Segmenting Tasks and Limiting Resource Use
Resource Constraints
Disk Space
Container Images on Worker Nodes
Logs
Overlay Networking and Routing Mesh
HTTP Routing Mesh
The recommended number of managers for a production cluster is 3 or 5. A 3-manager cluster can tolerate the loss of one manager, and a 5-manager cluster can tolerate two instantaneous manager failures. Clusters with more managers can tolerate more manager failures, but adding more managers also increases the overhead of maintaining and committing cluster state in the Docker Swarm Raft quorum. In some circumstances, clusters with more managers (for example 5 or 7) may be slower (in terms of cluster-update latency and throughput) than a cluster with 3 managers and otherwise similar specs.
In general, increasing the manager count does not make cluster operations faster (it may make them slower in some circumstances), does not increase the max cluster update operation throughput, and does not increase the total number of worker nodes that the cluster can manage.
Even when managers are down and there’s no quorum, services and tasks on the cluster keep running and are steady-state stable (although updating cluster state is not possible without quorum). For that reason, Docker recommends investing in quickly recovering from individual manager failures (e.g. automation/scripts for quickly adding replacement managers) rather than planning clusters with a large number of managers.
1-manager clusters should only be used for testing and experimentation since loss of the manager will cause cluster loss.
See also
Check out the documentation on how manager and worker nodes work.
Managers in a production cluster should ideally have at least 16GB of RAM and 4 vCPUs. Testing done by Docker has shown that managers with 16GB RAM are not memory constrained, even in clusters with 100s of workers and many services, networks, and other metadata.
Managers in production clusters should always use SSDs for the
/var/lib/docker/swarm
mount point. Docker stores swarm cluster state
in this directory and will read and write many small updates as cluster
state is updated. SSDs ensure that updates can be committed with minimal
latency. SSDs are also recommended for clusters used for test and
experimentation to ensure good performance.
Increasing CPU speed and count and improving network latency between manager nodes will also improve cluster performance.
For worker nodes, the overhead of Docker components and agents is not large — typically less than 1GB of memory. Deciding worker size and count can be done similar to how you currently size app or VM environments. For example, you can determine the app memory working set under load and factor in how many replicas you want for each app (for durability in case of task failure and/or for throughput). That will give you an idea of the total memory required across workers in the cluster.
Remember that Docker Swarm automatically reschedules tasks in case of worker node failure (or if you drain a node for upgrade or servicing), so don’t forget to leave headroom to handle tasks being rebalanced to other nodes.
Also remember that, unlike virtual machines, Docker containers add little or no memory or CPU overhead compared to running an app outside of a container. If you’re moving apps from individual VMs into containers, or if you’re consolidating many apps into an MKE cluster, you should be able to do so with fewer resources than you’re currently using.
On production clusters, restrict the running of workloads to worker nodes.
Important
Workloads should never be run on manager nodes in a production cluster.
If the tasks and services deployed on your cluster have very different resource profiles and if you want to use different node types for different tasks (for example with different disk, memory, or CPU characteristics) you can use node labels and service constraints to control where Swarm schedules tasks for a particular service.
You can also put nodes into collections and control access based on user accounts and teams. This is useful for isolating tasks managed by teams or individuals that are prone to deploying apps that consume many resources or exhibit other noisy neighbor characteristics that negatively affect tasks run by other teams. See the RBAC Knowledge Base article for examples of how to structure teams and projects that use MCR, MKE, and MSR.
MCR and MKE have support for applying resource limits to containers and
service tasks. Docker recommends using the --reserve-memory=<value>
and --limit-memory=<value>
parameters when creating services. These
let MCR and MKE better pack tasks on worker nodes based on expected
memory consumption.
Further, it might be a good idea to allocate a global (1 instance per node) “ghost” service that reserves a chunk (for example 2GB) of memory on each node that can be used by non-Docker system services. This is relevant because Docker Swarm does not currently account for worker node memory consumed by workloads not managed by Docker:
docker service create \
--name system-reservation \
--reserve-memory 2G \
--limit-memory 2G \
--reserve-cpu 1 \
--mode global \
nginx:latest
(nginx
does not actually do any work in this service. Any small
image that does not consume a lot of memory or CPU can be used instead
of nginx
).
See also
Check out the docs on container resource constraints and reserving memory or CPUs for a service.
For production clusters, there are few factors that drive worker disk space use that you should look out for:
In-use Docker container images on worker nodes
Local Docker volumes created for containers
Container logs stored on worker nodes
Mirantis Container Runtime logs stored on worker nodes
Temporary scratch data written by containers
To determine how much space to allocate for in-use images, try putting
some of your apps in containers and see how big the resulting images
are. Note that Docker images consist of layers, and if the same layer is
used by multiple containers (as is common of OS layers like ubuntu
or language framework layers like openjdk
), only one copy of that
layer is stored and used on any given node or Mirantis Secure Registry.
Layer sharing also means that deploying a new version of your app
typically only consumes a relatively small amount of extra space on
nodes (since only the top layers that hold your app are changed).
Note that Docker Windows container images often end up being somewhat larger than Linux ones.
To keep in-use container image storage in check, try to ensure that app images derive from common base images. Also consider running regular scripts or cron-jobs to prune unused images, especially if nodes handle many image update (e.g. build servers or test systems that see more frequent deploys). See the docs on image-pruning for details.
For production clusters, Docker recommends aggregating container logs
using a logging driver or other third party service. Only the
json-file
(and possibly journald
) log drivers cause container
logs to accumulate on nodes, and in that case, care should be taken to
rotate or remove old container logs. See Docker Logging Design and Best Practices
for details.
Mirantis Container Runtime logs are stored on worker and manager nodes. The
amount of MCR logs generated varies with workload and MCR settings. For
example, debug
log level causes more logs to be written. MCR logs should
be managed (compacted and eventually deleted) with a utility like logrotate.
MCR and MKE ship with a built-in, supported overlay networking driver for multi-host networking for use with Docker Swarm. Overlay networking incurs overhead associated with encapsulating network traffic and with managing IP addresses and other metadata that tracks networked tasks and services.
Customers that have apps with high network throughput requirements or workloads with high-frequency cluster or service updates should consider minimizing reliance on the out-of-the-box Docker overlay networking and routing mesh. There are several ways to do that:
Use host-mode publishing instead of routing mesh
Use the macvlan driver, which may have better performance than the default driver
Use a non-Docker service discovery mechanism (like Consul)
Consider using dnsrr
instead of vip
service endpoints
Overlay network size should not exceed /24
blocks (the default) with
256 IP addresses when networks are used by services created using
VIP-based endpoint-mode (the default). Users should not work around this
by increasing the IP block size. Instead, either use dnsrr
endpoint-mode or use multiple smaller overlay networks.
Also be aware that you may experience IP exhaustion if many tasks
are assigned to a single overlay network. This may occur, for example, if
many services are attached to the network or if services on the network
are scaled to many replicas. The problem may also manifest when tasks are
rescheduled because of node failures. In case of node failure, Docker
currently waits 24 hours to release overlay IP addresses. The problem can be
diagnosed by looking for failed to allocate network IP for task
messages in the Docker daemon logs.
MCR and MKE come with a built-in HTTP Routing Mesh feature. The HTTP Routing Mesh adds some overhead from extra network hops and routing control and should only be used to manage networking for externally-exposed services. For networking and routing between services hosted on Docker, simply use the standard built-in Docker overlay networking for best performance.
This section covers configuration of Mirantis Secure Registry for scale and performance.
Mirantis Secure Registry supports a wide range of storage backends. For scaling purposes, backend types can be classified either as filesystem-based (NFS, bind mount, volume) or cloud/blob-based (AWS S3, Swift, Azure Blob Storage, Google Cloud Storage).
For some uses, cloud/blob-based storage are more performant than
filesystem-based storage. This is because MSR can redirect layer GET
requests from clients directly to the backing store. By doing this the
actual image contents being pulled by Docker clients won’t have to
transit through MSR but can be fetched directly by Docker clients from
the backing store (once metadata has been fetched and credentials
checked by MSR).
When using filesystem-based storage (like NFS), ensure that MSR performance is not constrained by infrastructure. Common bottlenecks include host network interface cards, the load balancer deployed with MSR, throughput (IOPS) and latency of the backend storage system, and the CPU/memory of the MSR replica hosts.
Docker has tested MSR performance and determined that it can handle in excess of 1400 concurrent pulls of 1 GB container images using NFS-backed storage with 3 replicas.
The best way to understand future total image storage requirements is to gather and analyze the following data:
Average image size
Frequency of image updates/pushes
Average size of image update size (considering that many images may share common base layers)
Retention policies and requirements for storing old image artifacts
Use Mirantis Secure Registry Garbage Collection in combination with scripts or other automation that delete old images (using the MSR API) to keep storage use in check.
The Mirantis Secure Registry write-load is likely to be high when many developers or build machines are pushing images to MSR at the same time.
The read-load is likely to be high when a new image version is pushed to MSR and deployed to an MKE cluster with many instances all pulling the updated image.
If the same MSR cluster instance is used for both developer/build-server artifact storage and for production image artifact storage for a large production MKE cluster, the MSR cluster instances will experience both high write and read load. For very large deployments consider using two (or more) MSR clusters - one focused on supporting developers and build-servers writing images and another one that can handle very high instantaneous read loads generated by production deployments.
When estimating MSR performance requirements, consider average image and image update sizes, how many developers and build machines will be pushing and pulling from your MSR setup, and how many production nodes will concurrently pull updated images during deploys. Ensure that you have enough MSR instances and that your backing storage has enough read and write throughput to handle peak load.
To increase image pull throughput, consider using MSR caches as an alternative to adding more replicas.
Mirantis Secure Registry maintains a quorum of replicas that store metadata about repos, images, tags, and other MSR objects. 3 replicas is the minimum number of replicas for a highly available deployment. 1-replica deployments should only be used for testing and experimentation.
When using multiple MSR replicas, configure a loadbalancer so that requests are distributed to all MSR replicas.
A MSR cluster with 5 or 7 replicas may take longer to commit metadata updates (such as image pushes or tag updates) than one with 3 replicas because it takes longer for updates to propagate with a larger quorum.
If using MSR Security Scanning, note that MSR will run at most one concurrent scan per MSR replica. Adding more MSR replicas (or changing to replicas with faster hardware) will increase MSR scanning throughput. Note that MSR does not currently re-scan stored images when the vulnerability database is updated. Backlogs of queued scans are most likely to result from lots of images being updated.
In summary, you may want to consider using more than 3 MSR replicas to achieve:
Peak image push/pull throughput on NFS-based setups in excess of what 3 replicas can handle
More than 3 concurrent image vulnerability scans
Durability in case of more than 1 instantaneous MSR replica failure
Mirantis Secure Registry stores metadata about repos, images, tags, and
other objects in a database (user data is maintained by Universal
Control Plane). You can determine the size of the MSR database by
checking the size of the /data
directory in the dtr-rethink
container.
The time required to complete MSR cluster operations such as replica-join, backup, and restore is determined by the amount of metadata held by MSR.
If you’re planning a large deployment for use by multiple groups or business units, you should consider minimizing the number of clusters you will use. It is possible you will use a separate cluster for each business unit, but you will typically benefit from using as few clusters as possible.
MKE has strong team-based multi-tenancy strong :products:`team-based multi-tenancy controls <mke/authorize/isolate-nodes>, including the ability to have collections of worker nodes run only tasks and services created by specific teams. Using these features with a minimal number of clusters will let multiple business units or groups use MKE without the overhead you’ll incur by configuring and operating separate clusters for each business unit.
Even so, there are good reasons to consider using a greater number of clusters:
Stages: Even for smaller deployments it is a good idea to have separate non-production and production clusters. This allows critical changes and updates to be tested in isolation before deploying in production. More granular segmentation can be done if there are many stages (Test, QA, Staging, Prod, etc).
Team or Application separation: team-based multi-tenancy controls allow multiple apps to be safely run in the same cluster. However, more stringent security requirements may necessitate using separate clusters.
Region: Regional redundancy, compliance laws, or latency can all be reasons to segment workloads into multiple clusters.
The same concerns apply when planning how many MSR clusters to use. Note that you are currently limited to a 1:1 mapping between MKE and MSR cluster instances. That said, multiple MKE clusters can share a single MSR cluster with some feature limitations.
Planning your cluster deployment with scaling in mind will help maintain optimal performance and adequate disk space, and it will allow you to perform upgrades with little to no downtime.
See also
In planning large-scale MCR/MKE/MSR platform deployments, also consider the recommendations offered in Docker Enterprise Best Practices and Design Considerations.