MKE component metrics¶

Available since MKE 3.7.0

In addition to the core metrics that MKE exposes, you can use Prometheus to scrape a variety of metrics associated with MKE middleware components.

Herein, Mirantis outlines the components that expose Prometheus metrics, as well as offering detail on various key metrics. You should note, however, that this information is not exhaustive, but is rather a guideline to metrics that you may find especially useful in determining the overall health of your MKE deployment.

For specific key metrics, refer to the Usage information, which offers valuable insights on interpreting the data and using it to troubleshoot your MKE deployment.

Kube State Metrics¶

MKE deploys Kube State Metrics to expose metrics on the state of Kubernetes objects, such as Deployments, nodes, and Pods. These metrics are exposed in MKE on the ucp-kube-state-metrics service and can be scraped at ucp-kube-state-metrics.kube-system.svc.cluster.local:8080.

Note

Consult the documentation for Kube State Metrics for an extensive list of all the metrics exposed by Kube State Metrics.

Workqueue metrics for Kubernetes components¶

You can use workqueue metrics to learn how long it takes for various components to fulfill different actions and to check the level of work queue activity.

The metrics offered below are based on kube-controller-manager, however the same metrics are available for other Kubernetes components.

Usage

Abnormal workqueue metrics can be symptomatic of issues in the specific component. For example, an increase in workqueue_depth for the Kubernetes Controller Manager can indicate that the component is being oversaturated. In such cases, review the logs of the affected component.

workqueue_queue_duration_seconds_bucket¶

Description	Time that `kube-controller-manager` requires to fulfill the actions necessary to maintain the desired cluster status.
Example query	The following query checks the 99th percentile that `kube-controler-manager` needs to process items in the workqueue: histogram_quantile(0.99,sum(rate(workqueue_queue_duration_seconds_bucket{job="kube_controller_manager_nodes"}[5m])) by (instance, name, le))

workqueue_adds_total¶

Description	Measures additions to the workqueue. A high value can indicate issues with the component.
Example query	The following query checks the rate at which items are added to the workqueue: sum(rate(workqueue_adds_total{job="kube_controller_manager_nodes"}[5m])) by (instance, name)

workqueue_depth¶

Description	Relates to the size of the workqueue. The larger the workqueue, the more material there is to process. A growing trend in the size of the workqueue can be indicative of issues in the cluster.
Example query	sum(rate(workqueue_depth{job="kube_controller_manager_nodes"}[5m])) by(instance, name)

Kubelet metrics¶

The kubelet agent runs on every node in an MKE cluster. Once you have set up the MKE client bundle you can view the available kubelet metrics for each node in an MKE cluster using the commands detailed below:

Obtain the name of the first available node in your MKE cluster:

NODE_NAME=$(kubectl get node | sed -n '2 p' | awk '{print $1}')

Ping the kubelet metrics endpoint on the chosen node:

kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/metrics

The following are a number of key kubelet metrics:

kube_node_status_condition¶

Description	Reflects the total number of kubelet instances, which should correlate with the number of nodes in the cluster.
Example query	sum(kube_node_status_condition{condition="Ready", status= "true"})
Usage	If the number of kubelet instances decreases unexpectedly, review the nodes for connectivity issues.

kubelet_running_pods¶

Description	Indicates the total number of running Pods, which you can use to verify whether the number of Pods is in the expected range for your cluster.
Usage	If the number of Pods is unexpected on a node, review your Node Affinity or Node Selector rules to verify the scheduling of Pods for the appropriate nodes.

kubelet_running_containers¶

Description	Indicates the number of containers per node. You can query for a specific container state (`running`, `created`, `exited`). A high number of exited containers on a node can indicate issues on that node.
Example query	kubelet_running_containers{container_state="created"}
Usage	If the number of containers is unexpected on a node, check your Node Affinity or Node Selector rules to verify the scheduling of Pods for the appropriate nodes.

kubelet_runtime_operations_total¶

Description	Provides the total count of runtime operations, organized by type.
Example query	kubelet_runtime_operations_total{operation_type="create_container"}
Usage	An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node.

kubelet_runtime_operations_errors_total¶

Description	Displays the number of errors in runtime operations. Monitor this metric to learn of issues on a node.
Usage	An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node.

kubelet_runtime_operations_duration_seconds_bucket¶

Description	Reflects the time required for each runtime operation.
Example query	The following query checks the 99th percentile for time taken for various runtime operations. histogram_quantile(0.99, sum(rate(kubelet_runtime_operations_duration_seconds_bucket{instance=~".*"}[5m])) by (instance, operation_type, le))
Usage	An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node.

Kube Proxy¶

Kube Proxy runs on each node in an MKE cluster. Once you have set up the MKE client bundle, you can view the available Kube Proxy metrics for each node in an MKE cluster using the commands detailed below:

Note

The Kube Proxy metrics are only available when Kube Proxy is enabled in the MKE configuration and is running in either ipvs or iptables mode.

Obtain the name of the first available node in your MKE cluster:

NODE_NAME=$(kubectl get node | sed -n '2 p' | awk '{print $1}')

Ping the kubelet metrics endpoint on the chosen node:
```
kubectl get --raw /api/v1/nodes/${NODE_NAME}:10249/proxy/metrics
```
Note

Specify port 10249, as this is the port on which Kube Proxy metrics are exposed.

The following are a number of key Kube Proxy metrics:

kube_proxy_nodes¶

Description	Reflects the total number of Kube Proxy nodes, which should correlate with the number of nodes in the cluster.
Example query	sum(up{job="kube-proxy-nodes"})
Usage	If the number of kube-proxy instances decreases unexpectedly, check the nodes for connectivity issues.

rest_client_request_duration_seconds_bucket¶

Description	Reflects the latency of client requests, in seconds. Such information can be useful in determining whether your cluster is experiencing performance degradation.
Example query	The following query illustrates the latency for all POST requests. rest_client_request_duration_seconds_bucket{verb="POST"}
Usage	Review Kube Proxy logs on affected nodes to uncover any potential errors or timeouts.

kubeproxy_sync_proxy_rules_duration_seconds_bucket¶

Description	Displays the latency in seconds between Kube Proxy network rules, which are consistently synchronized between nodes. If the measurement is increasing consistently it can result in Kube Proxy being out of sync across the nodes.

rest_client_requests_total¶

Description	Monitors the HTTP response codes for all requests to Kube Proxy. An increase in 5xx response codes can indicate issues with Kube Proxy.
Example query	The following query presents the number of 5xx response codes from Kube Proxy. rest_client_requests_total{job="kube-proxy-nodes",code=~"5.."}
Usage	Review Kube Proxy logs on affected nodes to obtain details of the error responses.

Kube Controller Manager¶

Kube Controller Manager is a collection of different Kubernetes controllers whose primary task is to monitor changes in the state of various Kubenetes objects. It runs on all manager nodes in an MKE cluster.

Key Kube Controller Manager metrics are detailed as follows:

rest_client_request_duration_seconds_bucket¶

Description	Reflects the latency of calls to the API server, in seconds. Such information can be useful in determining whether your cluster is experiencing slower cluster performance.
Example query	The following query displays the 99th percentile latencies on requests to the API server. histogram_quantile(0.99, sum(rate(rest_client_request_duration_seconds_bucket{job="kube_controller_manager_nodes"}[5m])) by (url, le))
Usage	Review the Kube Controller Manager logs on affected nodes to determine whether the metrics are abnormal.

rest_client_requests_total¶

Description	Presents the total number of HTTP requests to Kube Controller Manager, segmented by HTTP response code. A sudden increase in requests or an increase in requests with error response codes can indicate issues with the cluster.
Example query	The following query displays the rate of successful HTTP requests (those offering 2xx response codes). sum(rate(rest_client_requests_total{job="kube_controller_manager_nodes" ,code=~"2.."}[5m]))
Usage	Review the Kube Controller Manager logs on affected nodes to determine whether the metrics are abnormal.

process_cpu_seconds_total¶

Description	Measures the total CPU time spent by a Kube Controller Manager instance.
Example query	rate(process_cpu_seconds_total{job="kube_controller_manager_nodes"}[5m])

process_resident_memory_bytes¶

Description	Measures the amount of resident memory by Kube Controller Manager instance.
Example query	rate(process_resident_memory_bytes{job="kube_controller_manager_nodes"}[5m])

Kube Apiserver¶

The Kube API server is the core of the Kubernetes control plane. It provides a means for obtaining information on Kubernetes objects and is also used to modify the state of API objects. MKE runs an instance of the Kube API server on each manager node.

The following are a number of key Kube Apiserver metrics:

apiserver_request_duration_seconds_bucket¶

Description	Measures latency for each request to the Kube API server.
Example query	The following query shows how latency is distributed across different HTTP verbs. histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers"}[5m])) by (verb, le))

apiserver_request_total¶

Description	Measures the total traffic to the api server, the resource being accessed, and whether the request is successful.
Example query	The following query measures the rate of requests that return 2xx HTTP response codes. You can modify the query to measure the rate of error requests. sum(rate(apiserver_request_total{job="kubernetes-apiservers",code=~"2.."}[5m]))

Calico¶

Calico is the default networking plugin for MKE. Specifically, MKE gathers metrics from both the Felix and Kube-Controllers Calico components.

Refer to the official Calico documentation on Prometheus statistics for detailed information on Felix and kube controllers metrics.

RethinkDB¶

MKE deploys RethinkDB Exporter on all manager nodes, to allow metrics scraping from RethinkDB. The RethinkDB Exporter exports most of the statistics from the RethinkDB stats table.

You can monitor the read and write throughput for each RethinkDB replica by reviewing the following metrics:

table_docs_per_second¶

Description	Current number of document reads and writes per second from the table.

cluster_docs_per_second¶

Description	Current number of document reads and writes per second from the cluster.

server_docs_per_second¶

Description	Current number of document reads and writes per second from the server.

These metrics are organized into read/write categories and by replica. For example, to view all the table read metrics on a specific node you can run the following query:

table_docs_per_second{operation="read", instance="instance_name"}

NodeLocalDNS¶

MKE deploys NodeLocalDNS on every node, with the Prometheus plugin enabled. You can scrape NodeLocalDNS metrics on port 9253, which provides regular CoreDNS metrics that include the standard RED (Rate, Errors, Duration) metrics:

queries
durations
error counts

The metrics path is fixed to /metrics.

Metric	Description
`coredns_build_info`	Information to build CoreDNS.
`coredns_cache_entries`	Number of entries in the cache.
`coredns_cache_size`	Cache size.
`coredns_cache_hits_total`	Counter of cache hits by cache type.
`coredns_cache_misses_total`	Counter of cache misses.
`coredns_cache_requests_total`	Total number of DNS resolution requests in different dimensions.
`coredns_dns_request_duration_seconds_bucket`	Histogram of DNS request duration (bucket).
`coredns_dns_request_duration_seconds_count`	Histogram of DNS request duration (count).
`coredns_dns_request_duration_seconds_sum`	Histogram of DNS request duration (sum).
`coredns_dns_request_size_bytes_bucket`	Histogram of the size of DNS request (bucket).
`coredns_dns_request_size_bytes_count`	Histogram of the size of DNS request (count).
`coredns_dns_request_size_bytes_sum`	Histogram of the size of DNS request (sum).
`coredns_dns_requests_total`	Number of DNS requests.
`coredns_dns_response_size_bytes_bucket`	Histogram of the size of DNS response (bucket).
`coredns_dns_response_size_bytes_count`	Histogram of the size of DNS response (count).
`coredns_dns_response_size_bytes_sum`	Histogram of the size of DNS response (sum).
`coredns_dns_responses_total`	DNS response codes and number of DNS response codes.
`coredns_forward_conn_cache_hits_total`	Number of cache hits for each protocol and data flow.
`coredns_forward_conn_cache_misses_total`	Number of cache misses for each protocol and data flow.
`coredns_forward_healthcheck_broken_total`	Unhealthy upstream count.
`coredns_forward_healthcheck_failures_total`	Count of failed health checks per upstream.
`coredns_forward_max_concurrent_rejects_total`	Number of requests rejected due to excessive concurrent requests.
`coredns_forward_request_duration_seconds_bucket`	Histogram of forward request duration (bucket).
`coredns_forward_request_duration_seconds_count`	Histogram of forward request duration (count).
`coredns_forward_request_duration_seconds_sum`	Histogram of forward request duration (sum).
`coredns_forward_requests_total`	Number of requests for each data flow.
`coredns_forward_responses_total`	Number of responses to each data flow.
`coredns_health_request_duration_seconds_bucket`	Histogram of health request duration (bucket).
`coredns_health_request_duration_seconds_count`	Histogram of health request duration (count).
`coredns_health_request_duration_seconds_sum`	Histogram of health request duration (sum).
`coredns_health_request_failures_total`	Number of health request failures.
`coredns_hosts_reload_timestamp_seconds`	Timestamp of the last reload of the host file.
`coredns_kubernetes_dns_programming_duration_seconds_bucket`	Histogram of DNS programming duration (bucket).
`coredns_kubernetes_dns_programming_duration_seconds_count`	Histogram of DNS programming duration (count).
`coredns_kubernetes_dns_programming_duration_seconds_sum`	Histogram of DNS programming duration (sum).
`coredns_local_localhost_requests_total`	Number of localhost requests.
`coredns_nodecache_setup_errors_total`	Number of nodecache setup errors.
`coredns_dns_response_rcode_count_total`	Number of responses for each Zone and Rcode.
`coredns_dns_request_count_total`	Number of DNS requests.
`coredns_dns_request_do_count_total`	Number of requests with the DNSSEC OK (DO) bit set.
`coredns_dns_do_requests_total`	Number of requests with the DO bit set.
`coredns_dns_request_type_count_total`	Number of requests for each Zone and Type.
`coredns_panics_total`	Total number of panics.
`coredns_plugin_enabled`	Whether a plugin is enabled.
`coredns_reload_failed_total`	Number of last reload failures.