MKE component metrics

Available since MKE 3.7.0

In addition to the core metrics that MKE exposes, you can use Prometheus to scrape a variety of metrics associated with MKE middleware components.

Herein, Mirantis outlines the components that expose Prometheus metrics, as well as offering detail on various key metrics. You should note, however, that this information is not exhaustive, but is rather a guideline to metrics that you may find especially useful in determining the overall health of your MKE deployment.

For specific key metrics, refer to the Usage information, which offers valuable insights on interpreting the data and using it to troubleshoot your MKE deployment.

Kube State Metrics

MKE deploys Kube State Metrics to expose metrics on the state of Kubernetes objects, such as Deployments, nodes, and Pods. These metrics are exposed in MKE on the ucp-kube-state-metrics service and can be scraped at ucp-kube-state-metrics.kube-system.svc.cluster.local:8080.

Note

Consult the documentation for Kube State Metrics for an extensive list of all the metrics exposed by Kube State Metrics.

Workqueue metrics for Kubernetes components

You can use workqueue metrics to learn how long it takes for various components to fulfill different actions and to check the level of work queue activity.

The metrics offered below are based on kube-controller-manager, however the same metrics are available for other Kubernetes components.

Usage

Abnormal workqueue metrics can be symptomatic of issues in the specific component. For example, an increase in workqueue_depth for the Kubernetes Controller Manager can indicate that the component is being oversaturated. In such cases, review the logs of the affected component.

workqueue_queue_duration_seconds_bucket

Description

Time that kube-controller-manager requires to fulfill the actions necessary to maintain the desired cluster status.

Example query

The following query checks the 99th percentile that kube-controler-manager needs to process items in the workqueue:

histogram_quantile(0.99,sum(rate(workqueue_queue_duration_seconds_bucket{job="kube_controller_manager_nodes"}[5m]))
by (instance, name, le))

workqueue_adds_total

Description

Measures additions to the workqueue. A high value can indicate issues with the component.

Example query

The following query checks the rate at which items are added to the workqueue:

sum(rate(workqueue_adds_total{job="kube_controller_manager_nodes"}[5m]))
by (instance, name)

workqueue_depth

Description

Relates to the size of the workqueue. The larger the workqueue, the more material there is to process. A growing trend in the size of the workqueue can be indicative of issues in the cluster.

Example query

sum(rate(workqueue_depth{job="kube_controller_manager_nodes"}[5m]))
by(instance, name)

Kubelet metrics

The kubelet agent runs on every node in an MKE cluster. Once you have set up the MKE client bundle you can view the available kubelet metrics for each node in an MKE cluster using the commands detailed below:

  • Obtain the name of the first available node in your MKE cluster:

    NODE_NAME=$(kubectl get node | sed -n '2 p' | awk '{print $1}')
    
  • Ping the kubelet metrics endpoint on the chosen node:

    kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/metrics
    

The following are a number of key kubelet metrics:

kube_node_status_condition

Description

Reflects the total number of kubelet instances, which should correlate with the number of nodes in the cluster.

Example query

sum(kube_node_status_condition{condition="Ready", status= "true"})

Usage

If the number of kubelet instances decreases unexpectedly, review the nodes for connectivity issues.

kubelet_running_pods

Description

Indicates the total number of running Pods, which you can use to verify whether the number of Pods is in the expected range for your cluster.

Usage

If the number of Pods is unexpected on a node, review your Node Affinity or Node Selector rules to verify the scheduling of Pods for the appropriate nodes.

kubelet_running_containers

Description

Indicates the number of containers per node. You can query for a specific container state (running, created, exited). A high number of exited containers on a node can indicate issues on that node.

Example query

kubelet_running_containers{container_state="created"}

Usage

If the number of containers is unexpected on a node, check your Node Affinity or Node Selector rules to verify the scheduling of Pods for the appropriate nodes.

kubelet_runtime_operations_total

Description

Provides the total count of runtime operations, organized by type.

Example query

kubelet_runtime_operations_total{operation_type="create_container"}

Usage

An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node.

kubelet_runtime_operations_errors_total

Description

Displays the number of errors in runtime operations. Monitor this metric to learn of issues on a node.

Usage

An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node.

kubelet_runtime_operations_duration_seconds_bucket

Description

Reflects the time required for each runtime operation.

Example query

The following query checks the 99th percentile for time taken for various runtime operations.

histogram_quantile(0.99,
sum(rate(kubelet_runtime_operations_duration_seconds_bucket{instance=~".*"}[5m]))
by (instance, operation_type, le))

Usage

An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node.

Kube Proxy

Kube Proxy runs on each node in an MKE cluster. Once you have set up the MKE client bundle, you can view the available Kube Proxy metrics for each node in an MKE cluster using the commands detailed below:

Note

The Kube Proxy metrics are only available when Kube Proxy is enabled in the MKE configuration and is running in either ipvs or iptables mode.

  • Obtain the name of the first available node in your MKE cluster:

    NODE_NAME=$(kubectl get node | sed -n '2 p' | awk '{print $1}')
    
  • Ping the kubelet metrics endpoint on the chosen node:

    kubectl get --raw /api/v1/nodes/${NODE_NAME}:10249/proxy/metrics
    

    Note

    Specify port 10249, as this is the port on which Kube Proxy metrics are exposed.

The following are a number of key Kube Proxy metrics:

kube_proxy_nodes

Description

Reflects the total number of Kube Proxy nodes, which should correlate with the number of nodes in the cluster.

Example query

sum(up{job="kube-proxy-nodes"})

Usage

If the number of kube-proxy instances decreases unexpectedly, check the nodes for connectivity issues.

rest_client_request_duration_seconds_bucket

Description

Reflects the latency of client requests, in seconds. Such information can be useful in determining whether your cluster is experiencing performance degradation.

Example query

The following query illustrates the latency for all POST requests.

rest_client_request_duration_seconds_bucket{verb="POST"}

Usage

Review Kube Proxy logs on affected nodes to uncover any potential errors or timeouts.

kubeproxy_sync_proxy_rules_duration_seconds_bucket

Description

Displays the latency in seconds between Kube Proxy network rules, which are consistently synchronized between nodes. If the measurement is increasing consistently it can result in Kube Proxy being out of sync across the nodes.

rest_client_requests_total

Description

Monitors the HTTP response codes for all requests to Kube Proxy. An increase in 5xx response codes can indicate issues with Kube Proxy.

Example query

The following query presents the number of 5xx response codes from Kube Proxy.

rest_client_requests_total{job="kube-proxy-nodes",code=~"5.."}

Usage

Review Kube Proxy logs on affected nodes to obtain details of the error responses.

Kube Controller Manager

Kube Controller Manager is a collection of different Kubernetes controllers whose primary task is to monitor changes in the state of various Kubenetes objects. It runs on all manager nodes in an MKE cluster.

Key Kube Controller Manager metrics are detailed as follows:

rest_client_request_duration_seconds_bucket

Description

Reflects the latency of calls to the API server, in seconds. Such information can be useful in determining whether your cluster is experiencing slower cluster performance.

Example query

The following query displays the 99th percentile latencies on requests to the API server.

histogram_quantile(0.99,
sum(rate(rest_client_request_duration_seconds_bucket{job="kube_controller_manager_nodes"}[5m]))
by (url, le))

Usage

Review the Kube Controller Manager logs on affected nodes to determine whether the metrics are abnormal.

rest_client_requests_total

Description

Presents the total number of HTTP requests to Kube Controller Manager, segmented by HTTP response code. A sudden increase in requests or an increase in requests with error response codes can indicate issues with the cluster.

Example query

The following query displays the rate of successful HTTP requests (those offering 2xx response codes).

sum(rate(rest_client_requests_total{job="kube_controller_manager_nodes"
,code=~"2.."}[5m]))

Usage

Review the Kube Controller Manager logs on affected nodes to determine whether the metrics are abnormal.

process_cpu_seconds_total

Description

Measures the total CPU time spent by a Kube Controller Manager instance.

Example query

rate(process_cpu_seconds_total{job="kube_controller_manager_nodes"}[5m])

process_resident_memory_bytes

Description

Measures the amount of resident memory by Kube Controller Manager instance.

Example query

rate(process_resident_memory_bytes{job="kube_controller_manager_nodes"}[5m])

Kube Apiserver

The Kube API server is the core of the Kubernetes control plane. It provides a means for obtaining information on Kubernetes objects and is also used to modify the state of API objects. MKE runs an instance of the Kube API server on each manager node.

The following are a number of key Kube Apiserver metrics:

apiserver_request_duration_seconds_bucket

Description

Measures latency for each request to the Kube API server.

Example query

The following query shows how latency is distributed across different HTTP verbs.

histogram_quantile(0.99,
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers"}[5m]))
by (verb, le))

apiserver_request_total

Description

Measures the total traffic to the api server, the resource being accessed, and whether the request is successful.

Example query

The following query measures the rate of requests that return 2xx HTTP response codes. You can modify the query to measure the rate of error requests.

sum(rate(apiserver_request_total{job="kubernetes-apiservers",code=~"2.."}[5m]))

Calico

Calico is the default networking plugin for MKE. Specifically, MKE gathers metrics from both the Felix and Kube-Controllers Calico components.

Refer to the official Calico documentation on Prometheus statistics for detailed information on Felix and kube controllers metrics.

RethinkDB

MKE deploys RethinkDB Exporter on all manager nodes, to allow metrics scraping from RethinkDB. The RethinkDB Exporter exports most of the statistics from the RethinkDB stats table.

You can monitor the read and write throughput for each RethinkDB replica by reviewing the following metrics:

table_docs_per_second

Description

Current number of document reads and writes per second from the table.

cluster_docs_per_second

Description

Current number of document reads and writes per second from the cluster.

server_docs_per_second

Description

Current number of document reads and writes per second from the server.

These metrics are organized into read/write categories and by replica. For example, to view all the table read metrics on a specific node you can run the following query:

table_docs_per_second{operation="read", instance="instance_name"}

NodeLocalDNS

MKE deploys NodeLocalDNS on every node, with the Prometheus plugin enabled. You can scrape NodeLocalDNS metrics on port 9253, which provides regular CoreDNS metrics that include the standard RED (Rate, Errors, Duration) metrics:

  • queries

  • durations

  • error counts

The metrics path is fixed to /metrics.

Metric

Description

coredns_build_info

Information to build CoreDNS.

coredns_cache_entries

Number of entries in the cache.

coredns_cache_size

Cache size.

coredns_cache_hits_total

Counter of cache hits by cache type.

coredns_cache_misses_total

Counter of cache misses.

coredns_cache_requests_total

Total number of DNS resolution requests in different dimensions.

coredns_dns_request_duration_seconds_bucket

Histogram of DNS request duration (bucket).

coredns_dns_request_duration_seconds_count

Histogram of DNS request duration (count).

coredns_dns_request_duration_seconds_sum

Histogram of DNS request duration (sum).

coredns_dns_request_size_bytes_bucket

Histogram of the size of DNS request (bucket).

coredns_dns_request_size_bytes_count

Histogram of the size of DNS request (count).

coredns_dns_request_size_bytes_sum

Histogram of the size of DNS request (sum).

coredns_dns_requests_total

Number of DNS requests.

coredns_dns_response_size_bytes_bucket

Histogram of the size of DNS response (bucket).

coredns_dns_response_size_bytes_count

Histogram of the size of DNS response (count).

coredns_dns_response_size_bytes_sum

Histogram of the size of DNS response (sum).

coredns_dns_responses_total

DNS response codes and number of DNS response codes.

coredns_forward_conn_cache_hits_total

Number of cache hits for each protocol and data flow.

coredns_forward_conn_cache_misses_total

Number of cache misses for each protocol and data flow.

coredns_forward_healthcheck_broken_total

Unhealthy upstream count.

coredns_forward_healthcheck_failures_total

Count of failed health checks per upstream.

coredns_forward_max_concurrent_rejects_total

Number of requests rejected due to excessive concurrent requests.

coredns_forward_request_duration_seconds_bucket

Histogram of forward request duration (bucket).

coredns_forward_request_duration_seconds_count

Histogram of forward request duration (count).

coredns_forward_request_duration_seconds_sum

Histogram of forward request duration (sum).

coredns_forward_requests_total

Number of requests for each data flow.

coredns_forward_responses_total

Number of responses to each data flow.

coredns_health_request_duration_seconds_bucket

Histogram of health request duration (bucket).

coredns_health_request_duration_seconds_count

Histogram of health request duration (count).

coredns_health_request_duration_seconds_sum

Histogram of health request duration (sum).

coredns_health_request_failures_total

Number of health request failures.

coredns_hosts_reload_timestamp_seconds

Timestamp of the last reload of the host file.

coredns_kubernetes_dns_programming_duration_seconds_bucket

Histogram of DNS programming duration (bucket).

coredns_kubernetes_dns_programming_duration_seconds_count

Histogram of DNS programming duration (count).

coredns_kubernetes_dns_programming_duration_seconds_sum

Histogram of DNS programming duration (sum).

coredns_local_localhost_requests_total

Number of localhost requests.

coredns_nodecache_setup_errors_total

Number of nodecache setup errors.

coredns_dns_response_rcode_count_total

Number of responses for each Zone and Rcode.

coredns_dns_request_count_total

Number of DNS requests.

coredns_dns_request_do_count_total

Number of requests with the DNSSEC OK (DO) bit set.

coredns_dns_do_requests_total

Number of requests with the DO bit set.

coredns_dns_request_type_count_total

Number of requests for each Zone and Type.

coredns_panics_total

Total number of panics.

coredns_plugin_enabled

Whether a plugin is enabled.

coredns_reload_failed_total

Number of last reload failures.