Monitor and troubleshoot

Monitor and troubleshoot

Monitoring the cluster status

You can monitor the status of MKE using the web UI or the CLI. You can also use the _ping endpoint to build monitoring automation.

The first place to check the status of MKE is the MKE web UI, since it shows warnings for situations that require your immediate attention. Administrators might see more warnings than regular users.

You can also navigate to the Nodes page, to see if all the nodes managed by MKE are healthy or not.

Each node has a status message explaining any problems with the node. In this example, a Windows worker node is down. Click the node to get more info on its status. In the details pane, click Actions and select Agent logs to see the log entries from the node.

Use the CLI to monitor the status of a cluster

You can also monitor the status of a MKE cluster using the Docker CLI client. Download a MKE client certificate bundle and then run:

docker node ls

As a rule of thumb, if the status message starts with [Pending], then the current state is transient and the node is expected to correct itself back into a healthy state.

Monitoring automation

You can use the https://<mke-manager-url>/_ping endpoint to check the health of a single MKE manager node. When you access this endpoint, the MKE manager validates that all its internal components are working, and returns one of the following HTTP error codes:

  • 200, if all components are healthy
  • 500, if one or more components are not healthy

If an administrator client certificate is used as a TLS client certificate for the _ping endpoint, a detailed error message is returned if any component is unhealthy.

If you’re accessing the _ping endpoint through a load balancer, you’ll have no way of knowing which MKE manager node is not healthy, since any manager node might be serving your request. Make sure you’re connecting directly to the URL of a manager node, and not a load balancer. In addition, please be aware that pinging the endpoint with HEAD will result in a 404 error code. It is better to use GET instead.

Monitoring vulnerability counts

For those implementations with a subscription, MKE displays image vulnerability count data from the MSR image scanning feature. MKE displays vulnerability counts for containers, Swarm services, pods, and images.

Monitoring disk usage

Web UI disk usage metrics, including free space, only reflect the Docker managed portion of the filesystem: /var/lib/docker. To monitor the total space available on each filesystem of a MKE worker or manager, you must deploy a third-party monitoring solution to monitor the operating system.

Troubleshooting MKE node states

There are several cases in the lifecycle of MKE when a node is actively transitioning from one state to another, such as when a new node is joining the cluster or during node promotion and demotion. In these cases, the current step of the transition will be reported by MKE as a node message. You can view the state of each individual node by monitoring the cluster status.

MKE node states

The following table lists all possible node states that may be reported for a MKE node, their explanation, and the expected duration of a given step.

Message Description Typical step duration
Completing node registration Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the MKE swarm. 5 - 30 seconds
heartbeat failure The node has not contacted any swarm managers in the last 10 seconds. Check Swarm state in docker info on the node. inactive means the node has been removed from the swarm with docker swarm leave. pending means dockerd on the node has been attempting to contact a manager since dockerd on the node started. Confirm network security policy allows tcp port 2377 from the node to managers. error means an error prevented swarm from starting on the node. Check docker daemon logs on the node. Until resolved
Node is being reconfigured The ucp-reconcile container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state. 1 - 60 seconds
Reconfiguration pending The target node is expected to be a manager but the ucp-reconcile container has not been started yet. 1 - 10 seconds
The ucp-agent task is state The ucp-agent task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the MKE cluster. This step may take a longer time duration than expected if the MKE images need to be pulled from Docker Hub on the affected node. 1 - 10 seconds
Unable to determine node state The ucp-reconcile container on the target node just started running and we are not able to determine its state. 1 - 10 seconds
Unhealthy MKE Controller: node is unreachable Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there’s either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and contact support if the symptom persists. Until resolved
Unhealthy MKE Controller: unable to reach controller The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of MKE itself. Please ensure the underlying networking infrastructure is operational, and contact support if the symptom persists. Until resolved
Unhealthy MKE Controller: Docker Swarm Cluster: Local node <ip> has status Pending The Engine ID of an engine is not unique in the swarm. When a node first joins the cluster, it’s added to the node inventory and discovered as Pending by Docker Swarm. The engine is “validated” if a ucp-swarm-manager container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your engines don’t have duplicate IDs. Use docker info to see the Engine ID. Refresh the ID by removing the /etc/docker/key.json file and restarting the daemon. Until resolved

Troubleshooting your cluster using logs

If you detect problems in your MKE cluster, you can start your troubleshooting session by checking the logs of the individual MKE components. Only administrators can see information about MKE system containers.

Check the logs from the web UI

To see the logs of the MKE system containers, navigate to the Containers page of the MKE web UI. By default, MKE system containers are hidden. Click the Settings icon and check Show system resources to view the MKE system containers.

Click on a container to see more details, such as its configurations and logs.

Check the logs from the CLI

You can also check the logs of MKE system containers from the CLI. This is specially useful if the MKE web application is not working.

  1. Get a client certificate bundle.

    When using the Docker CLI client, you need to authenticate using client certificates. If your client certificate bundle is for a non-admin user, you do not have permission to see the MKE system containers.

  2. Check the logs of MKE system containers. By default, system containers aren’t displayed. Use the -a flag to display them.

    $ docker ps -a
    CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS                     PORTS                                                                             NAMES
    8b77cfa87889        docker/ucp-agent:latest             "/bin/ucp-agent re..."   3 hours ago         Exited (0) 3 hours ago                                                                                       ucp-reconcile
    b844cf76a7a5        docker/ucp-agent:latest             "/bin/ucp-agent agent"   3 hours ago         Up 3 hours                 2376/tcp                                                                          ucp-agent.tahzo3m4xjwhtsn6l3n8oc2bf.xx2hf6dg4zrphgvy2eohtpns9
    de5b45871acb        docker/ucp-controller:latest        "/bin/controller s..."   3 hours ago         Up 3 hours (unhealthy)     0.0.0.0:443->8080/tcp                                                             ucp-controller
    ...
    
  3. Get the log from a MKE container by using the docker logs <mke container ID> command. For example, the following command emits the log for the ucp-controller container listed above.

    $ docker logs de5b45871acb
    
    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json",
    "remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs",
    "remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
    

Get a support dump

Before making any changes to MKE, download a support dump. This allows you to troubleshoot problems which were already happening before changing MKE configurations.

You can then increase the MKE log level to debug, making it easier to understand the status of the MKE cluster. Changing the MKE log level restarts all MKE system components and introduces a small downtime window to MKE. Your applications will not be affected by this downtime.

To increase the MKE log level, navigate to the MKE web UI, go to the Admin Settings tab, and choose Logs.

Once you change the log level to Debug, the MKE containers restart. Now that the MKE components are creating more descriptive logs, you can download a support dump and use it to troubleshoot the component causing the problem.

Depending on the problem you’re experiencing, it’s more likely that you’ll find related messages in the logs of specific components on manager nodes:

  • If the problem occurs after a node was added or removed, check the logs of the ucp-reconcile container.
  • If the problem occurs in the normal state of the system, check the logs of the ucp-controller container.
  • If you are able to visit the MKE web UI but unable to log in, check the logs of the ucp-auth-api and ucp-auth-store containers.

It’s normal for the ucp-reconcile container to be in a stopped state. This container starts only when the ucp-agent detects that a node needs to transition to a different state. The ucp-reconcile container is responsible for creating and removing containers, issuing certificates, and pulling missing images.

Troubleshooting cluster configurations

MKE automatically tries to heal itself by monitoring its internal components and trying to bring them to a healthy state.

In most cases, if a single MKE component is in a failed state persistently, you should be able to restore the cluster to a healthy state by removing the unhealthy node from the cluster and joining it again.

Troubleshoot the etcd key-value store

MKE persists configuration data on an etcd key-value store and RethinkDB database that are replicated on all manager nodes of the MKE cluster. These data stores are for internal use only and should not be used by other applications.

With the HTTP API

In this example we’ll use curl for making requests to the key-value store REST API, and jq to process the responses.

You can install these tools on a Ubuntu distribution by running:

sudo apt-get update && sudo apt-get install curl jq
  1. Use a client bundle to authenticate your requests.

  2. Use the REST API to access the cluster configurations. The $DOCKER_HOST and $DOCKER_CERT_PATH environment variables are set when using the client bundle.

    export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
    
    curl -s \
         --cert ${DOCKER_CERT_PATH}/cert.pem \
         --key ${DOCKER_CERT_PATH}/key.pem \
         --cacert ${DOCKER_CERT_PATH}/ca.pem \
         ${KV_URL}/v2/keys | jq "."
    

With the CLI client

The containers running the key-value store, include etcdctl, a command line client for etcd. You can run it using the docker exec command.

The examples below assume you are logged in with ssh into a MKE manager node.

docker exec -it ucp-kv etcdctl \
        --endpoint https://127.0.0.1:2379 \
        --ca-file /etc/docker/ssl/ca.pem \
        --cert-file /etc/docker/ssl/cert.pem \
        --key-file /etc/docker/ssl/key.pem \
        cluster-health

member 16c9ae1872e8b1f0 is healthy: got healthy result from https://192.168.122.64:12379
member c5a24cfdb4263e72 is healthy: got healthy result from https://192.168.122.196:12379
member ca3c1bb18f1b30bf is healthy: got healthy result from https://192.168.122.223:12379
cluster is healthy

On failure, the command exits with an error code and no output.

RethinkDB Database

User and organization data for Docker Enterprise Edition is stored in a RethinkDB database which is replicated across all manager nodes in the MKE cluster.

Replication and failover of this database is typically handled automatically by MKE’s own configuration management processes, but detailed database status and manual reconfiguration of database replication is available through a command line tool available as part of MKE.

The examples below assume you are logged in with ssh into a MKE manager node.

Check the status of the database

# NODE_ADDRESS will be the IP address of this Docker Swarm manager node
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
# VERSION will be your most recent version of the docker/ucp-auth image
VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
# This command will output detailed status of all servers and database tables
# in the RethinkDB cluster.
docker container run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status

Server Status: [
  {
    "ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
    "Name": "ucp_auth_store_192_168_1_25",
    "Network": {
      "CanonicalAddresses": [
        {
          "Host": "192.168.1.25",
          "Port": 12384
        }
      ],
      "TimeConnected": "2017-07-14T17:21:44.198Z"
    }
  }
]
...

Manually reconfigure database replication

# NODE_ADDRESS will be the IP address of this Docker Swarm manager node
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
# NUM_MANAGERS will be the current number of manager nodes in the cluster
NUM_MANAGERS=$(docker node ls --filter role=manager -q | wc -l)
# VERSION will be your most recent version of the docker/ucp-auth image
VERSION=$(docker image ls --format '{{.Tag}}' docker/ucp-auth | head -n 1)
# This reconfigure-db command will repair the RethinkDB cluster to have a
# number of replicas equal to the number of manager nodes in the cluster.
docker container run --rm -v ucp-auth-store-certs:/tls docker/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS}

time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..."
time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]"
time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1"
time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Reconfiguring Table Replication..."
time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Reconfigured Replication of Table \"grant_objects\""
...

Loss of Quorum in RethinkDB Tables

When there is loss of quorum in any of the RethinkDB tables, run the reconfigure-db command with the --emergency-repair flag.