Troubleshoot an MKE cluster

Troubleshoot an MKE cluster

Troubleshooting MKE node states

There are several cases in the lifecycle of MKE when a node is actively transitioning from one state to another, such as when a new node is joining the cluster or during node promotion and demotion. In these cases, the current step of the transition will be reported by MKE as a node message. You can view the state of each individual node by monitoring the cluster status.

MKE node states

The following table lists all possible node states that may be reported for a MKE node, their explanation, and the expected duration of a given step.

Message

Description

Typical step duration

Completing node registration

Waiting for the node to appear in KV node inventory. This is expected to occur when a node first joins the MKE swarm.

5 - 30 seconds

heartbeat failure

The node has not contacted any swarm managers in the last 10 seconds. Check Swarm state in docker info on the node. inactive means the node has been removed from the swarm with docker swarm leave. pending means dockerd on the node has been attempting to contact a manager since dockerd on the node started. Confirm network security policy allows tcp port 2377 from the node to managers. error means an error prevented swarm from starting on the node. Check docker daemon logs on the node.

Until resolved

Node is being reconfigured

The ucp-reconcile container is currently converging the current state of the node to the desired state. This process may involve issuing certificates, pulling missing images, and starting containers, depending on the current node state.

1 - 60 seconds

Reconfiguration pending

The target node is expected to be a manager but the ucp-reconcile container has not been started yet.

1 - 10 seconds

The ucp-agent task is state

The ucp-agent task on the target node is not in a running state yet. This is an expected message when configuration has been updated, or when a new node was first joined to the MKE cluster. This step may take a longer time duration than expected if the MKE images need to be pulled from Docker Hub on the affected node.

1 - 10 seconds

Unable to determine node state

The ucp-reconcile container on the target node just started running and we are not able to determine its state.

1 - 10 seconds

Unhealthy MKE Controller: node is unreachable

Other manager nodes of the cluster have not received a heartbeat message from the affected node within a predetermined timeout. This usually indicates that there’s either a temporary or permanent interruption in the network link to that manager node. Ensure the underlying networking infrastructure is operational, and contact support if the symptom persists.

Until resolved

Unhealthy MKE Controller: unable to reach controller

The controller that we are currently communicating with is not reachable within a predetermined timeout. Please refresh the node listing to see if the symptom persists. If the symptom appears intermittently, this could indicate latency spikes between manager nodes, which can lead to temporary loss in the availability of MKE itself. Please ensure the underlying networking infrastructure is operational, and contact support if the symptom persists.

Until resolved

Unhealthy MKE Controller: Docker Swarm Cluster: Local node <ip> has status Pending

The Engine ID of MCR is not unique in the swarm. When a node first joins the cluster, it’s added to the node inventory and discovered as Pending by Docker Swarm. MCR is “validated” if a ucp-swarm-manager container can connect to it via TLS, and if its Engine ID is unique in the swarm. If you see this issue repeatedly, make sure that your MCR doesn’t have duplicate IDs. Use docker info to see the Engine ID. Refresh the ID by removing the /etc/docker/key.json file and restarting the daemon.

Until resolved

Troubleshooting your cluster using logs

If you detect problems in your MKE cluster, you can start your troubleshooting session by checking the logs of the individual MKE components. Only administrators can see information about MKE system containers.

Check the logs from the web UI

To see the logs of the MKE system containers, navigate to the Containers page of the MKE web UI. By default, MKE system containers are hidden. Click the Settings icon and check Show system resources to view the MKE system containers.

Click on a container to see more details, such as its configurations and logs.

Check the logs from the CLI

You can also check the logs of MKE system containers from the CLI. This is specially useful if the MKE web application is not working.

  1. Get a client certificate bundle.

    When using the Docker CLI client, you need to authenticate using client certificates. If your client certificate bundle is for a non-admin user, you do not have permission to see the MKE system containers.

  2. Check the logs of MKE system containers. By default, system containers aren’t displayed. Use the -a flag to display them.

    $ docker ps -a
    CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS                     PORTS                                                                             NAMES
    8b77cfa87889        mirantis/ucp-agent:latest             "/bin/ucp-agent re..."   3 hours ago         Exited (0) 3 hours ago                                                                                       ucp-reconcile
    b844cf76a7a5        mirantis/ucp-agent:latest             "/bin/ucp-agent agent"   3 hours ago         Up 3 hours                 2376/tcp                                                                          ucp-agent.tahzo3m4xjwhtsn6l3n8oc2bf.xx2hf6dg4zrphgvy2eohtpns9
    de5b45871acb        mirantis/ucp-controller:latest        "/bin/controller s..."   3 hours ago         Up 3 hours (unhealthy)     0.0.0.0:443->8080/tcp                                                             ucp-controller
    ...
    
  3. Get the log from a MKE container by using the docker logs <mke container ID> command. For example, the following command emits the log for the ucp-controller container listed above.

    $ docker logs de5b45871acb
    
    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/json",
    "remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
    {"level":"info","license_key":"PUagrRqOXhMH02UgxWYiKtg0kErLY8oLZf1GO4Pw8M6B","msg":"/v1.22/containers/ucp/ucp-controller/logs",
    "remote_addr":"192.168.10.1:59546","tags":["api","v1.22","get"],"time":"2016-04-25T23:49:27Z","type":"api","username":"dave.lauper"}
    

Get a support dump

Before making any changes to MKE, download a support dump. This allows you to troubleshoot problems which were already happening before changing MKE configurations.

You can then increase the MKE log level to debug, making it easier to understand the status of the MKE cluster. Changing the MKE log level restarts all MKE system components and introduces a small downtime window to MKE. Your applications will not be affected by this downtime.

To increase the MKE log level, navigate to the MKE web UI, go to the Admin Settings tab, and choose Logs.

Once you change the log level to Debug, the MKE containers restart. Now that the MKE components are creating more descriptive logs, you can download a support dump and use it to troubleshoot the component causing the problem.

Depending on the problem you’re experiencing, it’s more likely that you’ll find related messages in the logs of specific components on manager nodes:

  • If the problem occurs after a node was added or removed, check the logs of the ucp-reconcile container.

  • If the problem occurs in the normal state of the system, check the logs of the ucp-controller container.

  • If you are able to visit the MKE web UI but unable to log in, check the logs of the ucp-auth-api and ucp-auth-store containers.

It’s normal for the ucp-reconcile container to be in a stopped state. This container starts only when the ucp-agent detects that a node needs to transition to a different state. The ucp-reconcile container is responsible for creating and removing containers, issuing certificates, and pulling missing images.

Troubleshooting cluster configurations

MKE automatically tries to heal itself by monitoring its internal components and trying to bring them to a healthy state.

In most cases, if a single MKE component is in a failed state persistently, you should be able to restore the cluster to a healthy state by removing the unhealthy node from the cluster and joining it again.

Troubleshoot the etcd key-value store

MKE persists configuration data on an etcd key-value store and RethinkDB database that are replicated on all manager nodes of the MKE cluster. These data stores are for internal use only and should not be used by other applications.

With the HTTP API

In this example we’ll use curl for making requests to the key-value store REST API, and jq to process the responses.

You can install these tools on a Ubuntu distribution by running:

sudo apt-get update && sudo apt-get install curl jq
  1. Use a client bundle to authenticate your requests.

  2. Use the REST API to access the cluster configurations. The $DOCKER_HOST and $DOCKER_CERT_PATH environment variables are set when using the client bundle.

    export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
    
    curl -s \
         --cert ${DOCKER_CERT_PATH}/cert.pem \
         --key ${DOCKER_CERT_PATH}/key.pem \
         --cacert ${DOCKER_CERT_PATH}/ca.pem \
         ${KV_URL}/v2/keys | jq "."
    

With the CLI client

The containers running the key-value store, include etcdctl, a command line client for etcd. You can run it using the docker exec command.

The examples below assume you are logged in with ssh into a MKE manager node.

docker exec -it ucp-kv etcdctl \
        --endpoint https://127.0.0.1:2379 \
        --ca-file /etc/docker/ssl/ca.pem \
        --cert-file /etc/docker/ssl/cert.pem \
        --key-file /etc/docker/ssl/key.pem \
        cluster-health

member 16c9ae1872e8b1f0 is healthy: got healthy result from https://192.168.122.64:12379
member c5a24cfdb4263e72 is healthy: got healthy result from https://192.168.122.196:12379
member ca3c1bb18f1b30bf is healthy: got healthy result from https://192.168.122.223:12379
cluster is healthy

On failure, the command exits with an error code and no output.

RethinkDB Database

User and organization data for MKE is stored in a RethinkDB database which is replicated across all manager nodes in the MKE cluster.

Replication and failover of this database is typically handled automatically by MKE’s own configuration management processes, but detailed database status and manual reconfiguration of database replication is available through a command line tool available as part of MKE.

The examples below assume you are logged in with ssh into a MKE manager node.

Check the status of the database

# NODE_ADDRESS will be the IP address of this Docker Swarm manager node
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
# VERSION will be your most recent version of mirantis/ucp-auth image
VERSION=$(docker image ls --format '{{.Tag}}' mirantis/ucp-auth | head -n 1)
# This command will output detailed status of all servers and database tables
# in the RethinkDB cluster.
docker container run --rm -v ucp-auth-store-certs:/tls mirantis/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status

Server Status: [
  {
    "ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
    "Name": "ucp_auth_store_192_168_1_25",
    "Network": {
      "CanonicalAddresses": [
        {
          "Host": "192.168.1.25",
          "Port": 12384
        }
      ],
      "TimeConnected": "2017-07-14T17:21:44.198Z"
    }
  }
]
...

Manually reconfigure database replication

# NODE_ADDRESS will be the IP address of this Docker Swarm manager node
NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
# NUM_MANAGERS will be the current number of manager nodes in the cluster
NUM_MANAGERS=$(docker node ls --filter role=manager -q | wc -l)
# VERSION will be your most recent version of the mirantis/ucp-auth image
VERSION=$(docker image ls --format '{{.Tag}}' mirantis/ucp-auth | head -n 1)
# This reconfigure-db command will repair the RethinkDB cluster to have a
# number of replicas equal to the number of manager nodes in the cluster.
docker container run --rm -v ucp-auth-store-certs:/tls mirantis/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS}

time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..."
time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]"
time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1"
time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Reconfiguring Table Replication..."
time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Reconfigured Replication of Table \"grant_objects\""
...

Loss of Quorum in RethinkDB Tables

When there is loss of quorum in any of the RethinkDB tables, run the reconfigure-db command with the --emergency-repair flag.