You can monitor the status of MKE using the web UI or the CLI. You
can also use the _ping
endpoint to build monitoring automation.
The first place to check the status of MKE is the MKE web UI, since it shows warnings for situations that require your immediate attention. Administrators might see more warnings than regular users.
You can also navigate to the Nodes page, to see if all the nodes managed by MKE are healthy or not.
Each node has a status message explaining any problems with the node. In this example, a Windows worker node is down. Click the node to get more info on its status. In the details pane, click Actions and select Agent logs to see the log entries from the node.
You can also monitor the status of a MKE cluster using the Docker CLI client. Download a MKE client certificate bundle and then run:
docker node ls
As a rule of thumb, if the status message starts with [Pending]
,
then the current state is transient and the node is expected to correct
itself back into a healthy state.
You can use the https://<mke-manager-url>/_ping
endpoint to check
the health of a single MKE manager node. When you access this endpoint,
the MKE manager validates that all its internal components are working,
and returns one of the following HTTP error codes:
If an administrator client certificate is used as a TLS client
certificate for the _ping
endpoint, a detailed error message is
returned if any component is unhealthy.
If you’re accessing the _ping
endpoint through a load balancer,
you’ll have no way of knowing which MKE manager node is not healthy,
since any manager node might be serving your request. Make sure you’re
connecting directly to the URL of a manager node, and not a load
balancer. In addition, please be aware that pinging the endpoint with
HEAD will result in a 404 error code. It is better to use GET instead.
For those implementations with a subscription, MKE displays image vulnerability count data from the MSR image scanning feature. MKE displays vulnerability counts for containers, Swarm services, pods, and images.
Web UI disk usage metrics, including free space, only reflect the Docker
managed portion of the filesystem: /var/lib/docker
. To monitor the
total space available on each filesystem of a MKE worker or manager, you
must deploy a third-party monitoring solution to monitor the operating
system.
There are several cases in the lifecycle of MKE when a node is actively transitioning from one state to another, such as when a new node is joining the cluster or during node promotion and demotion. In these cases, the current step of the transition will be reported by MKE as a node message. You can view the state of each individual node by monitoring the cluster status.
The following table lists all possible node states that may be reported for a MKE node, their explanation, and the expected duration of a given step.
Message | Description | Typical step duration |
---|---|---|
Completing node registration | Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the MKE swarm. | 5 - 30 seconds |
heartbeat failure | The node has not contacted any swarm managers in the last 10 seconds.
Check Swarm state in docker info on the node. inactive means the
node has been removed from the swarm with docker swarm leave .
pending means dockerd on the node has been attempting to contact a
manager since dockerd on the node started. Confirm network security
policy allows tcp port 2377 from the node to managers. error means
an error prevented swarm from starting on the node. Check docker daemon
logs on the node. |
Until resolved |
Node is being reconfigured | The ucp-reconcile container is currently converging the current
state of the node to the desired state. This process may involve issuing
certificates, pulling missing images, and starting containers, depending
on the current node state. |
1 - 60 seconds |
Reconfiguration pending | The target node is expected to be a manager but the ucp-reconcile
container has not been started yet. |
1 - 10 seconds |
The ucp-agent task is state |
The ucp-agent task on the target node is not in a running state yet.
This is an expected message when configuration has been updated, or when
a new node was first joined to the MKE cluster. This step may take a
longer time duration than expected if the MKE images need to be pulled
from Docker Hub on the affected node. |
1 - 10 seconds |
Unable to determine node state | The ucp-reconcile container on the target node just started running
and we are not able to determine its state. |
1 - 10 seconds |
Unhealthy MKE Controller: node is unreachable | Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there’s either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and contact support if the symptom persists. | Until resolved |
Unhealthy MKE Controller: unable to reach controller | The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of MKE itself. Please ensure the underlying networking infrastructure is operational, and contact support if the symptom persists. | Until resolved |
Unhealthy MKE Controller: Docker Swarm Cluster: Local node <ip> has status Pending | The Engine ID of an engine is not unique in the swarm. When a node first
joins the cluster, it’s added to the node inventory and discovered as
Pending by Docker Swarm. The engine is “validated” if a
ucp-swarm-manager container can connect to it via TLS, and if its
Engine ID is unique in the swarm. If you see this issue repeatedly, make
sure that your engines don’t have duplicate IDs. Use docker info to see
the Engine ID. Refresh the ID by removing the /etc/docker/key.json
file and restarting the daemon. |
Until resolved |
If you detect problems in your MKE cluster, you can start your troubleshooting session by checking the logs of the individual MKE components. Only administrators can see information about MKE system containers.
To see the logs of the MKE system containers, navigate to the Containers page of the MKE web UI. By default, MKE system containers are hidden. Click the Settings icon and check Show system resources to view the MKE system containers.
Click on a container to see more details, such as its configurations and logs.
You can also check the logs of MKE system containers from the CLI. This is specially useful if the MKE web application is not working.
Get a client certificate bundle.
When using the Docker CLI client, you need to authenticate using client certificates. If your client certificate bundle is for a non-admin user, you do not have permission to see the MKE system containers.
Check the logs of MKE system containers. By default, system
containers aren’t displayed. Use the -a
flag to display them.
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8b77cfa87889 docker/ucp-agent:latest "/bin/ucp-agent re..." 3 hours ago Exited (0) 3 hours ago ucp-reconcile
b844cf76a7a5 docker/ucp-agent:latest "/bin/ucp-agent agent" 3 hours ago Up 3 hours 2376/tcp ucp-agent.tahzo3m4xjwhtsn6l3n8oc2bf.xx2hf6dg4zrphgvy2eohtpns9
de5b45871acb docker/ucp-controller:latest "/bin/controller s..." 3 hours ago Up 3 hours (unhealthy) 0.0.0.0:443->8080/tcp ucp-controller
...
Get the log from a MKE container by using the
docker logs <mke container ID>
command. For example, the
following command emits the log for the ucp-controller
container
listed above.
$ docker logs de5b45871acb
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json",
"remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
{"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs",
"remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
Before making any changes to MKE, download a support dump. This allows you to troubleshoot problems which were already happening before changing MKE configurations.
You can then increase the MKE log level to debug, making it easier to understand the status of the MKE cluster. Changing the MKE log level restarts all MKE system components and introduces a small downtime window to MKE. Your applications will not be affected by this downtime.
To increase the MKE log level, navigate to the MKE web UI, go to the Admin Settings tab, and choose Logs.
Once you change the log level to Debug, the MKE containers restart. Now that the MKE components are creating more descriptive logs, you can download a support dump and use it to troubleshoot the component causing the problem.
Depending on the problem you’re experiencing, it’s more likely that you’ll find related messages in the logs of specific components on manager nodes:
ucp-reconcile
container.ucp-controller
container.ucp-auth-api
and ucp-auth-store
containers.It’s normal for the ucp-reconcile
container to be in a stopped
state. This container starts only when the ucp-agent
detects that a
node needs to transition to a different state. The ucp-reconcile
container is responsible for creating and removing containers, issuing
certificates, and pulling missing images.
MKE automatically tries to heal itself by monitoring its internal components and trying to bring them to a healthy state.
In most cases, if a single MKE component is in a failed state persistently, you should be able to restore the cluster to a healthy state by removing the unhealthy node from the cluster and joining it again.
MKE persists configuration data on an etcd key-value store and RethinkDB database that are replicated on all manager nodes of the MKE cluster. These data stores are for internal use only and should not be used by other applications.
In this example we’ll use curl
for making requests to the key-value
store REST API, and jq
to process the responses.
You can install these tools on a Ubuntu distribution by running:
sudo apt-get update && sudo apt-get install curl jq
Use a client bundle to authenticate your requests.
Use the REST API to access the cluster configurations. The
$DOCKER_HOST
and $DOCKER_CERT_PATH
environment variables are
set when using the client bundle.
export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
curl -s \
--cert ${DOCKER_CERT_PATH}/cert.pem \
--key ${DOCKER_CERT_PATH}/key.pem \
--cacert ${DOCKER_CERT_PATH}/ca.pem \
${KV_URL}/v2/keys | jq "."
The containers running the key-value store, include etcdctl
, a
command line client for etcd. You can run it using the docker exec
command.
The examples below assume you are logged in with ssh into a MKE manager node.
docker exec -it ucp-kv etcdctl \
--endpoint https://127.0.0.1:2379 \
--ca-file /etc/docker/ssl/ca.pem \
--cert-file /etc/docker/ssl/cert.pem \
--key-file /etc/docker/ssl/key.pem \
cluster-health
member 16c9ae1872e8b1f0 is healthy: got healthy result from https://192.168.122.64:12379
member c5a24cfdb4263e72 is healthy: got healthy result from https://192.168.122.196:12379
member ca3c1bb18f1b30bf is healthy: got healthy result from https://192.168.122.223:12379
cluster is healthy
On failure, the command exits with an error code and no output.
User and organization data for Docker Enterprise Edition is stored in a RethinkDB database which is replicated across all manager nodes in the MKE cluster.
Replication and failover of this database is typically handled automatically by MKE’s own configuration management processes, but detailed database status and manual reconfiguration of database replication is available through a command line tool available as part of MKE.
The examples below assume you are logged in with ssh into a MKE manager node.
# NODE_ADDRESS will be the IP address of this Docker Swarm manager node
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
# VERSION will be your most recent version of the docker/ucp-auth image
VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
# This command will output detailed status of all servers and database tables
# in the RethinkDB cluster.
docker container run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status
Server Status: [
{
"ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
"Name": "ucp_auth_store_192_168_1_25",
"Network": {
"CanonicalAddresses": [
{
"Host": "192.168.1.25",
"Port": 12384
}
],
"TimeConnected": "2017-07-14T17:21:44.198Z"
}
}
]
...
# NODE_ADDRESS will be the IP address of this Docker Swarm manager node
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
# NUM_MANAGERS will be the current number of manager nodes in the cluster
NUM_MANAGERS=$(docker node ls --filter role=manager -q | wc -l)
# VERSION will be your most recent version of the docker/ucp-auth image
VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
# This reconfigure-db command will repair the RethinkDB cluster to have a
# number of replicas equal to the number of manager nodes in the cluster.
docker container run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS}
time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..."
time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]"
time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1"
time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Reconfiguring Table Replication..."
time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Reconfigured Replication of Table \"grant_objects\""
...
Loss of Quorum in RethinkDB Tables
When there is loss of quorum in any of the RethinkDB tables, run the
reconfigure-db
command with the --emergency-repair
flag.