Monitor an MKE cluster

You can monitor the health of your MKE cluster using the MKE web UI, the CLI, and the _ping endpoint. This topic describes how to monitor your cluster health, vulnerability counts, and disk usage.

For those running MSR in addition to MKE, MKE displays image vulnerability scanning count data obtained from MSR for containers, Swarm services, Pods, and images. This feature requires that you run MSR 2.6.x or later and enable MKE single sign-on.

The MKE web UI only displays the disk usage metrics, including space availability, for the /var/lib/docker part of the filesystem. Monitoring the total space available on each filesystem of an MKE worker or manager node requires that you deploy a third-party operating system-monitoring solution.

Monitor with the MKE web UI

  1. Log in to the MKE web UI.

  2. From the left-side navigation panel, navigate to the Dashboard page.

    Cluster health-related warnings that require your immediate attention display on the cluster dashboard. A greater number of such warnings are likely to present for MKE administrators than for regular users.

  3. Navigate to Shared Resources > Nodes to inspect the health of the nodes that MKE manages. To read the node health status, hover over the colored indicator.

  4. Click a particular node to learn more about its health.

  5. Click on the vertical ellipsis in the top right corner and select Tasks.

  6. From the left-side navigation panel, click Agent Logs to examine log entries.

Monitor with the CLI

  1. Download and configure the client bundle.

  2. Examine the health of the nodes in your cluster:

    docker node ls
    

    Status messages that begin with [Pending] indicate a transient state that is expected to resolve itself and return to a healthy state.

Automate the monitoring process

Automate the MKE cluster monitoring process by using the https://<mke-manager-url>/_ping endpoint to evaluate the health of a single manager node. The MKE manager evaluates whether its internal components are functioning properly, and returns one of the following HTTP codes:

  • 200 - all components are healthy

  • 500 - one or more components are not healthy

Using an administrator client certificate as a TLS client certificate for the _ping endpoint returns a detailed error message if any component is unhealthy.

Do not access the _ping endpoint with a load balancer, as this method does not allow you to determine which manager node is not healthy. Instead, connect directly to the URL of a manager node. Use GET to ping the endpoint instead of HEAD, as HEAD returns a 404 error code.