Troubleshoot cluster configurations

MKE regularly monitors its internal components, attempting to resolve issues as it discovers them.

In most cases where a single MKE component remains in a persistently failed state, removing and rejoining the unhealthy node restores the cluster to a healthy state.

MKE persists configuration data on an etcd key-value store and RethinkDB database that are replicated on all MKE manager nodes. These data stores are for internal use only and should not be used by other applications.

Troubleshoot the etcd key-value store with the HTTP API

This example uses curl to make requests to the key-value store REST API and jq to process the responses.

  1. Install curl and jq on a Ubuntu distribution:

    sudo apt-get update && sudo apt-get install curl jq
    
  2. Use a client bundle to authenticate your requests. Download and configure the client bundle if you have not done so already.

  3. Use the REST API to access the cluster configurations. The $DOCKER_HOST and $DOCKER_CERT_PATH environment variables are set when using the client bundle.

    export KV_URL="https://$(echo $DOCKER_HOST | cut -f3 -d/ | cut -f1 -d:):12379"
    
    curl -s \
         --cert ${DOCKER_CERT_PATH}/cert.pem \
         --key ${DOCKER_CERT_PATH}/key.pem \
         --cacert ${DOCKER_CERT_PATH}/ca.pem \
         ${KV_URL}/v2/keys | jq "."
    

Troubleshoot the etcd key-value store with the CLI

Execution of the MKE etcd key-value store takes place in containers with the name ucp-kv. To check the health of etcd clusters, execute commands inside these containers using docker exec` with etcdctl.

  1. Log in to a manager node using SSH.

  2. Troubleshoot an etcd key-value store:

    docker exec -it ucp-kv sh -c \
    'etcdctl --cluster=true endpoint health -w table 2>/dev/null'
    

    If the command fails, an error code is the only output that displays.

Troubleshoot your cluster configuration using the RethinkDB database

User and organization data for MKE is stored in a RethinkDB database, which is replicated across all manager nodes in the MKE cluster.

The database replication and failover is typically handled automatically by the MKE configuration management processes. However, you can use the CLI to review the status of the database and manually reconfigure database replication.

  1. Log in to a manager node using SSH.

  2. Produce a detailed status of all servers and database tables in the RethinkDB cluster:

    NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
    VERSION=$(docker image ls --format '{{.Tag}}' mirantis/ucp-auth | head -n 1)
    docker container run --rm -v ucp-auth-store-certs:/tls mirantis/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 db-status
    
    • NODE_ADDRESS is the IP address of this Docker Swarm manager node.

    • VERSION is the most recent version of the mirantis/ucp-auth image.

    Expected output:

    Server Status: [
      {
        "ID": "ffa9cd5a-3370-4ccd-a21f-d7437c90e900",
        "Name": "ucp_auth_store_192_168_1_25",
        "Network": {
          "CanonicalAddresses": [
            {
              "Host": "192.168.1.25",
              "Port": 12384
            }
          ],
          "TimeConnected": "2017-07-14T17:21:44.198Z"
        }
      }
    ]
    ...
    
  3. Repair the RethinkDB cluster so that the number of replicas it has is equal to the number of manager nodes in the cluster.

    NODE_ADDRESS=$(docker info --format '{{.Swarm.NodeAddr}}')
    NUM_MANAGERS=$(docker node ls --filter role=manager -q | wc -l)
    VERSION=$(docker image ls --format '{{.Tag}}' mirantis/ucp-auth | head -n 1)
    docker container run --rm -v ucp-auth-store-certs:/tls mirantis/ucp-auth:${VERSION} --db-addr=${NODE_ADDRESS}:12383 --debug reconfigure-db --num-replicas ${NUM_MANAGERS}
    
    • NODE_ADDRESS is the IP address of this Docker Swarm manager node.

    • NUM_MANAGERS is the current number of manager nodes in the cluster.

    • VERSION is the most recent version of the mirantis/ucp-auth image.

    Example output:

    time="2017-07-14T20:46:09Z" level=debug msg="Connecting to db ..."
    time="2017-07-14T20:46:09Z" level=debug msg="connecting to DB Addrs: [192.168.1.25:12383]"
    time="2017-07-14T20:46:09Z" level=debug msg="Reconfiguring number of replicas to 1"
    time="2017-07-14T20:46:09Z" level=debug msg="(00/16) Reconfiguring Table Replication..."
    time="2017-07-14T20:46:09Z" level=debug msg="(01/16) Reconfigured Replication of Table \"grant_objects\""
    ...
    

Note

If the quorum in any of the RethinkDB tables is lost, run the reconfigure-db command with the --emergency-repair flag.

See also