This reference architecture will help you plan large-scale Docker Enterprise deployments. It covers both the core Docker Enterprise platform, Mirantis Kubernetes Engine, and Mirantis Secure Registry. Use this guide to help size hardware and infrastructure for your Docker Enterprise deployments and to determine optimal configuration for your specific workloads.
For Docker Enterprise, Mirantis Kubernetes Engine, and Mirantis Secure Registry, the guide covers:
This section covers configuration of the base Docker Enterprise platform and Mirantis Kubernetes Engine for optimal performance and growth potential.
The recommended number of managers for a production cluster is 3 or 5. A 3-manager cluster can tolerate the loss of one manager, and a 5-manager cluster can tolerate two instantaneous manager failures. Clusters with more managers can tolerate more manager failures, but adding more managers also increases the overhead of maintaining and committing cluster state in the Docker Swarm Raft quorum. In some circumstances, clusters with more managers (for example 5 or 7) may be slower (in terms of cluster-update latency and throughput) than a cluster with 3 managers and otherwise similar specs.
In general, increasing the manager count does not make cluster operations faster (it may make them slower in some circumstances), does not increase the max cluster update operation throughput, and does not increase the total number of worker nodes that the cluster can manage.
Even when managers are down and there’s no quorum, services and tasks on the cluster keep running and are steady-state stable (although updating cluster state is not possible without quorum). For that reason, Docker recommends investing in quickly recovering from individual manager failures (e.g. automation/scripts for quickly adding replacement managers) rather than planning clusters with a large number of managers.
1-manager clusters should only be used for testing and experimentation since loss of the manager will cause cluster loss.
See also
Check out the documentation on how manager and worker nodes work.
Managers in a production cluster should ideally have at least 16GB of RAM and 4 vCPUs. Testing done by Docker has shown that managers with 16GB RAM are not memory constrained, even in clusters with 100s of workers and many services, networks, and other metadata.
Managers in production clusters should always use SSDs for the
/var/lib/docker/swarm
mount point. Docker stores swarm cluster state
in this directory and will read and write many small updates as cluster
state is updated. SSDs ensure that updates can be committed with minimal
latency. SSDs are also recommended for clusters used for test and
experimentation to ensure good performance.
Increasing CPU speed and count and improving network latency between manager nodes will also improve cluster performance.
For worker nodes, the overhead of Docker components and agents is not large — typically less than 1GB of memory. Deciding worker size and count can be done similar to how you currently size app or VM environments. For example, you can determine the app memory working set under load and factor in how many replicas you want for each app (for durability in case of task failure and/or for throughput). That will give you an idea of the total memory required across workers in the cluster.
Remember that Docker Swarm automatically reschedules tasks in case of worker node failure (or if you drain a node for upgrade or servicing), so don’t forget to leave headroom to handle tasks being rebalanced to other nodes.
Also remember that, unlike virtual machines, Docker containers add little or no memory or CPU overhead compared to running an app outside of a container. If you’re moving apps from individual VMs into containers, or if you’re consolidating many apps into a Docker Enterprise cluster, you should be able to do so with less resources than what’s currently used.
On production clusters, never run workloads on manager nodes. This is a configurable manager node setting in Mirantis Kubernetes Engine (MKE).
If the tasks and services deployed on your cluster have very different resource profiles and if you want to use different node types for different tasks (for example with different disk, memory, or CPU characteristics) you can use node labels and service constraints to control where Swarm schedules tasks for a particular service.
You can also put nodes into collections and control access based on user accounts and teams. This is useful for isolating tasks managed by teams or individuals that are prone to deploying apps that consume many resources or exhibit other noisy neighbor characteristics that negatively affect tasks run by other teams. See the RBAC Knowledge Base article for examples of how to structure teams and projects with Docker Enterprise Edition.
Docker Enterprise has support for applying resource limits to containers and
service tasks. Docker recommends using the --reserve-memory=<value>
and --limit-memory=<value>
parameters when creating services. These
let Docker Enterprise better pack tasks on worker nodes based on expected
memory consumption.
Further, it might be a good idea to allocate a global (1 instance per node) “ghost” service that reserves a chunk (for example 2GB) of memory on each node that can be used by non-Docker system services. This is relevant because Docker Swarm does not currently account for worker node memory consumed by workloads not managed by Docker:
docker service create \
--name system-reservation \
--reserve-memory 2G \
--limit-memory 2G \
--reserve-cpu 1 \
--mode global \
nginx:latest
(nginx
does not actually do any work in this service. Any small
image that does not consume a lot of memory or CPU can be used instead
of nginx
).
See also
Check out the docs on container resource constraints and reserving memory or CPUs for a service.
For production clusters, there are few factors that drive worker disk space use that you should look out for:
To determine how much space to allocate for in-use images, try putting
some of your apps in containers and see how big the resulting images
are. Note that Docker images consist of layers, and if the same layer is
used by multiple containers (as is common of OS layers like ubuntu
or language framework layers like openjdk
), only one copy of that
layer is stored and used on any given node or Mirantis Secure Registry.
Layer sharing also means that deploying a new version of your app
typically only consumes a relatively small amount of extra space on
nodes (since only the top layers that hold your app are changed).
Note that Docker Windows container images often end up being somewhat larger than Linux ones.
To keep in-use container image storage in check, try to ensure that app images derive from common base images. Also consider running regular scripts or cron-jobs to prune unused images, especially if nodes handle many image update (e.g. build servers or test systems that see more frequent deploys). See the docs on image-pruning for details.
For production clusters, Docker recommends aggregating container logs
using a logging driver or other third party service. Only the
json-file
(and possibly journald
) log drivers cause container
logs to accumulate on nodes, and in that case, care should be taken to
rotate or remove old container logs. See Docker Logging Design and Best Practices
for details.
Mirantis Container Runtime logs are stored on worker and manager nodes. The
amount of Mirantis Container Runtime logs generated varies with workload and
engine settings. For example, debug
log level causes more logs to be
written. Mirantis Container Runtime logs should be managed (compacted and
eventually deleted) with a utility like logrotate.
Docker Enterprise ships with a built-in, supported overlay networking driver for multi-host networking for use with Docker Swarm. Overlay networking incurs overhead associated with encapsulating network traffic and with managing IP addresses and other metadata that tracks networked tasks and services.
Docker Enterprise customers that have apps with very network high-throughput requirements or workloads that are extremely dynamic (high-frequency cluster or service updates) should consider minimizing reliance on the out-of-the-box Docker overlay networking and routing mesh. There are several ways to do that:
dnsrr
instead of vip
service endpointsOverlay network size should not exceed /24
blocks (the default) with
256 IP addresses when networks are used by services created using
VIP-based endpoint-mode (the default). Users should not work around this
by increasing the IP block size. Instead, either use dnsrr
endpoint-mode or use multiple smaller overlay networks.
Also be aware that Docker Enterprise may experience IP exhaustion if many tasks
are assigned to a single overlay network, for example if many services
are attached to that network or if services on the network are scaled to
many replicas. The problem may also manifest when tasks are rescheduled
because of node failures. In case of node failure, Docker currently
waits 24 hours to release overlay IP addresses. The problem can be
diagnosed by looking for failed to allocate network IP for task
messages in the Docker daemon logs.
Docker Enterprise Edition with Mirantis Kubernetes Engine come with a built-in HTTP Routing Mesh feature. HTTP Routing Mesh adds some overhead from extra network hops and routing control and should only be used for managing networking for externally exposed services. For networking and routing between services hosted on Docker, simply use the standard built-in Docker overlay networking for best performance.
This section covers configuration of Mirantis Secure Registry for scale and performance.
Mirantis Secure Registry supports a wide range of storage backends. For scaling purposes, backend types can be classified either as filesystem-based (NFS, bind mount, volume) or cloud/blob-based (AWS S3, Swift, Azure Blob Storage, Google Cloud Storage).
For some uses, cloud/blob-based storage are more performant than
filesystem-based storage. This is because MSR can redirect layer GET
requests from clients directly to the backing store. By doing this the
actual image contents being pulled by Docker clients won’t have to
transit through MSR but can be fetched directly by Docker clients from
the backing store (once metadata has been fetched and credentials
checked by MSR).
When using filesystem-based storage (like NFS), ensure that MSR performance is not constrained by infrastructure. Common bottlenecks include host network interface cards, the load balancer deployed with MSR, throughput (IOPS) and latency of the backend storage system, and the CPU/memory of the MSR replica hosts.
Docker has tested MSR performance and determined that it can handle in excess of 1400 concurrent pulls of 1 GB container images using NFS-backed storage with 3 replicas.
The best way to understand future total image storage requirements is to gather and analyze the following data:
Use Mirantis Secure Registry Garbage Collection in combination with scripts or other automation that delete old images (using the MSR API) to keep storage use in check.
The Mirantis Secure Registry write-load is likely to be high when many developers or build machines are pushing images to MSR at the same time.
Read-load is likely to be high when a new image version is pushed to MSR and is then deployed to a large Docker Enterprise cluster with many instances that are all pulling the updated image.
If the same MSR cluster instance is used for both developer/build-server artifact storage and for production image artifact storage for a large production Docker Enterprise MKE cluster, the MSR cluster instances will experience both high write and read load. For very large deployments consider using two (or more) MSR clusters - one focused on supporting developers and build-servers writing images and another one that can handle very high instantaneous read loads generated by production deployments.
When estimating MSR performance requirements, consider average image and image update sizes, how many developers and build machines will be pushing and pulling from your MSR setup, and how many production nodes will concurrently pull updated images during deploys. Ensure that you have enough MSR instances and that your backing storage has enough read and write throughput to handle peak load.
To increase image pull throughput, consider using MSR caches as an alternative to adding more replicas.
Mirantis Secure Registry maintains a quorum of replicas that store metadata about repos, images, tags, and other MSR objects. 3 replicas is the minimum number of replicas for a highly available deployment. 1-replica deployments should only be used for testing and experimentation.
When using multiple MSR replicas, configure a loadbalancer so that requests are distributed to all MSR replicas.
A MSR cluster with 5 or 7 replicas may take longer to commit metadata updates (such as image pushes or tag updates) than one with 3 replicas because it takes longer for updates to propagate with a larger quorum.
If using MSR Security Scanning, note that MSR will run at most one concurrent scan per MSR replica. Adding more MSR replicas (or changing to replicas with faster hardware) will increase MSR scanning throughput. Note that MSR does not currently re-scan stored images when the vulnerability database is updated. Backlogs of queued scans are most likely to result from lots of images being updated.
In summary, you may want to consider using more than 3 MSR replicas to achieve:
Mirantis Secure Registry stores metadata about repos, images, tags, and other
objects in a database (user data is maintained by Mirantis Kubernetes Engine).
You can determine the size of the MSR database by checking the size of the
/data
directory in the dtr-rethink
container.
The time required to complete MSR cluster operations such as replica-join, backup, and restore is determined by the amount of metadata held by MSR.
If you’re planning a large Docker Enterprise deployment that’s going to be used by multiple groups or business units, you should consider whether to run a single cluster or multiple clusters (e.g. one for each business unit). Both are valid options, but you will typically get greater benefits from consolidation by using just one or a few clusters.
Docker Enterprise Edition has strong team-based multi-tenancy controls, including assigning collections of worker nodes to only run tasks and services created by specific teams. Using these features with a single - or a few - clusters, will let multiple business units or groups use Docker Enterprise Edition without the overhead of configuring and operating multiple clusters.
Even so, there might be good reasons to use multiple clusters:
The same concerns apply when planning how many MSR clusters to use. Note that Docker Enterprise with Mirantis Kubernetes Engine and MSR are currently limited to a 1:1 mapping between MKE and MSR cluster instances, although multiple MKE clusters can share a single MSR cluster with some feature limitations.
Planning your Docker Enterprise deployment with scaling in mind will help maintain optimal performance, adequate disk space, and more as workloads grow. It will also allow you to perform upgrades with little to no downtime.
See also
While using this guide to plan and architect large-scale Docker Enterprise Edition deployments, also consider the recommendations in Docker Enterprise Best Practices and Design Considerations.