MKE component metrics¶
Available since MKE 3.7.0
In addition to the core metrics that MKE exposes, you can use Prometheus to scrape a variety of metrics associated with MKE middleware components.
Herein, Mirantis outlines the components that expose Prometheus metrics, as well as offering detail on various key metrics. You should note, however, that this information is not exhaustive, but is rather a guideline to metrics that you may find especially useful in determining the overall health of your MKE deployment.
For specific key metrics, refer to the Usage information, which offers valuable insights on interpreting the data and using it to troubleshoot your MKE deployment.
Kube State Metrics¶
MKE deploys Kube State Metrics to expose metrics on
the state of Kubernetes objects, such as Deployments, nodes, and Pods. These
metrics are exposed in MKE on the ucp-kube-state-metrics
service and can
be scraped at ucp-kube-state-metrics.kube-system.svc.cluster.local:8080
.
Note
Consult the documentation for Kube State Metrics for an extensive list of all the metrics exposed by Kube State Metrics.
Workqueue metrics for Kubernetes components¶
You can use workqueue metrics to learn how long it takes for various components to fulfill different actions and to check the level of work queue activity.
The metrics offered below are based on kube-controller-manager
,
however the same metrics are available for other Kubernetes components.
Usage
Abnormal workqueue metrics can be symptomatic of issues in the specific
component. For example, an increase in workqueue_depth
for the
Kubernetes Controller Manager can indicate that the component is being
oversaturated. In such cases, review the logs of the affected component.
workqueue_queue_duration_seconds_bucket¶
Description |
Time that |
---|---|
Example query |
The following query checks the 99th percentile that
histogram_quantile(0.99,sum(rate(workqueue_queue_duration_seconds_bucket{job="kube_controller_manager_nodes"}[5m]))
by (instance, name, le))
|
workqueue_adds_total¶
Description |
Measures additions to the workqueue. A high value can indicate issues with the component. |
---|---|
Example query |
The following query checks the rate at which items are added to the workqueue: sum(rate(workqueue_adds_total{job="kube_controller_manager_nodes"}[5m]))
by (instance, name)
|
workqueue_depth¶
Description |
Relates to the size of the workqueue. The larger the workqueue, the more material there is to process. A growing trend in the size of the workqueue can be indicative of issues in the cluster. |
---|---|
Example query |
sum(rate(workqueue_depth{job="kube_controller_manager_nodes"}[5m]))
by(instance, name)
|
Kubelet metrics¶
The kubelet agent runs on every node in an MKE cluster. Once you have set up the MKE client bundle you can view the available kubelet metrics for each node in an MKE cluster using the commands detailed below:
Obtain the name of the first available node in your MKE cluster:
NODE_NAME=$(kubectl get node | sed -n '2 p' | awk '{print $1}')
Ping the kubelet metrics endpoint on the chosen node:
kubectl get --raw /api/v1/nodes/${NODE_NAME}/proxy/metrics
The following are a number of key kubelet metrics:
kube_node_status_condition¶
Description |
Reflects the total number of kubelet instances, which should correlate with the number of nodes in the cluster. |
---|---|
Example query |
sum(kube_node_status_condition{condition="Ready", status= "true"})
|
Usage |
If the number of kubelet instances decreases unexpectedly, review the nodes for connectivity issues. |
kubelet_running_pods¶
Description |
Indicates the total number of running Pods, which you can use to verify whether the number of Pods is in the expected range for your cluster. |
---|---|
Usage |
If the number of Pods is unexpected on a node, review your Node Affinity or Node Selector rules to verify the scheduling of Pods for the appropriate nodes. |
kubelet_running_containers¶
Description |
Indicates the number of containers per node. You can query for a
specific container state ( |
---|---|
Example query |
kubelet_running_containers{container_state="created"}
|
Usage |
If the number of containers is unexpected on a node, check your Node Affinity or Node Selector rules to verify the scheduling of Pods for the appropriate nodes. |
kubelet_runtime_operations_total¶
Description |
Provides the total count of runtime operations, organized by type. |
---|---|
Example query |
kubelet_runtime_operations_total{operation_type="create_container"}
|
Usage |
An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node. |
kubelet_runtime_operations_errors_total¶
Description |
Displays the number of errors in runtime operations. Monitor this metric to learn of issues on a node. |
---|---|
Usage |
An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node. |
kubelet_runtime_operations_duration_seconds_bucket¶
Description |
Reflects the time required for each runtime operation. |
---|---|
Example query |
The following query checks the 99th percentile for time taken for various runtime operations. histogram_quantile(0.99,
sum(rate(kubelet_runtime_operations_duration_seconds_bucket{instance=~".*"}[5m]))
by (instance, operation_type, le))
|
Usage |
An increase in runtime operations duration and/or runtime operations errors can indicate problems with the container runtime on the node. |
Kube Proxy¶
Kube Proxy runs on each node in an MKE cluster. Once you have set up the MKE client bundle, you can view the available Kube Proxy metrics for each node in an MKE cluster using the commands detailed below:
Note
The Kube Proxy metrics are only available when Kube Proxy is enabled in the
MKE configuration and is running in either ipvs
or iptables
mode.
Obtain the name of the first available node in your MKE cluster:
NODE_NAME=$(kubectl get node | sed -n '2 p' | awk '{print $1}')
Ping the kubelet metrics endpoint on the chosen node:
kubectl get --raw /api/v1/nodes/${NODE_NAME}:10249/proxy/metrics
Note
Specify port 10249, as this is the port on which Kube Proxy metrics are exposed.
The following are a number of key Kube Proxy metrics:
kube_proxy_nodes¶
Description |
Reflects the total number of Kube Proxy nodes, which should correlate with the number of nodes in the cluster. |
---|---|
Example query |
sum(up{job="kube-proxy-nodes"})
|
Usage |
If the number of kube-proxy instances decreases unexpectedly, check the nodes for connectivity issues. |
rest_client_request_duration_seconds_bucket¶
Description |
Reflects the latency of client requests, in seconds. Such information can be useful in determining whether your cluster is experiencing performance degradation. |
---|---|
Example query |
The following query illustrates the latency for all POST requests. rest_client_request_duration_seconds_bucket{verb="POST"}
|
Usage |
Review Kube Proxy logs on affected nodes to uncover any potential errors or timeouts. |
kubeproxy_sync_proxy_rules_duration_seconds_bucket¶
Description |
Displays the latency in seconds between Kube Proxy network rules, which are consistently synchronized between nodes. If the measurement is increasing consistently it can result in Kube Proxy being out of sync across the nodes. |
---|
rest_client_requests_total¶
Description |
Monitors the HTTP response codes for all requests to Kube Proxy. An increase in 5xx response codes can indicate issues with Kube Proxy. |
---|---|
Example query |
The following query presents the number of 5xx response codes from Kube Proxy. rest_client_requests_total{job="kube-proxy-nodes",code=~"5.."}
|
Usage |
Review Kube Proxy logs on affected nodes to obtain details of the error responses. |
Kube Controller Manager¶
Kube Controller Manager is a collection of different Kubernetes controllers whose primary task is to monitor changes in the state of various Kubenetes objects. It runs on all manager nodes in an MKE cluster.
Key Kube Controller Manager metrics are detailed as follows:
rest_client_request_duration_seconds_bucket¶
Description |
Reflects the latency of calls to the API server, in seconds. Such information can be useful in determining whether your cluster is experiencing slower cluster performance. |
---|---|
Example query |
The following query displays the 99th percentile latencies on requests to the API server. histogram_quantile(0.99,
sum(rate(rest_client_request_duration_seconds_bucket{job="kube_controller_manager_nodes"}[5m]))
by (url, le))
|
Usage |
Review the Kube Controller Manager logs on affected nodes to determine whether the metrics are abnormal. |
rest_client_requests_total¶
Description |
Presents the total number of HTTP requests to Kube Controller Manager, segmented by HTTP response code. A sudden increase in requests or an increase in requests with error response codes can indicate issues with the cluster. |
---|---|
Example query |
The following query displays the rate of successful HTTP requests (those offering 2xx response codes). sum(rate(rest_client_requests_total{job="kube_controller_manager_nodes"
,code=~"2.."}[5m]))
|
Usage |
Review the Kube Controller Manager logs on affected nodes to determine whether the metrics are abnormal. |
process_cpu_seconds_total¶
Description |
Measures the total CPU time spent by a Kube Controller Manager instance. |
---|---|
Example query |
rate(process_cpu_seconds_total{job="kube_controller_manager_nodes"}[5m])
|
process_resident_memory_bytes¶
Description |
Measures the amount of resident memory by Kube Controller Manager instance. |
---|---|
Example query |
rate(process_resident_memory_bytes{job="kube_controller_manager_nodes"}[5m])
|
Kube Apiserver¶
The Kube API server is the core of the Kubernetes control plane. It provides a means for obtaining information on Kubernetes objects and is also used to modify the state of API objects. MKE runs an instance of the Kube API server on each manager node.
The following are a number of key Kube Apiserver metrics:
apiserver_request_duration_seconds_bucket¶
Description |
Measures latency for each request to the Kube API server. |
---|---|
Example query |
The following query shows how latency is distributed across different HTTP verbs. histogram_quantile(0.99,
sum(rate(apiserver_request_duration_seconds_bucket{job="kubernetes-apiservers"}[5m]))
by (verb, le))
|
apiserver_request_total¶
Description |
Measures the total traffic to the api server, the resource being accessed, and whether the request is successful. |
---|---|
Example query |
The following query measures the rate of requests that return 2xx HTTP response codes. You can modify the query to measure the rate of error requests. sum(rate(apiserver_request_total{job="kubernetes-apiservers",code=~"2.."}[5m]))
|
Calico¶
Calico is the default networking plugin for MKE. Specifically, MKE gathers metrics from both the Felix and Kube-Controllers Calico components.
Refer to the official Calico documentation on Prometheus statistics for detailed information on Felix and kube controllers metrics.
RethinkDB¶
MKE deploys RethinkDB Exporter on all manager nodes, to allow metrics scraping from RethinkDB. The RethinkDB Exporter exports most of the statistics from the RethinkDB stats table.
You can monitor the read and write throughput for each RethinkDB replica by reviewing the following metrics:
table_docs_per_second¶
Description |
Current number of document reads and writes per second from the table. |
---|
cluster_docs_per_second¶
Description |
Current number of document reads and writes per second from the cluster. |
---|
server_docs_per_second¶
Description |
Current number of document reads and writes per second from the server. |
---|
These metrics are organized into read/write categories and by replica. For example, to view all the table read metrics on a specific node you can run the following query:
table_docs_per_second{operation="read", instance="instance_name"}
NodeLocalDNS¶
MKE deploys NodeLocalDNS on every node, with the Prometheus plugin enabled. You
can scrape NodeLocalDNS metrics on port 9253
, which provides regular
CoreDNS metrics that include the standard RED (Rate, Errors, Duration) metrics:
queries
durations
error counts
The metrics path is fixed to /metrics
.
Metric |
Description |
---|---|
|
Information to build CoreDNS. |
|
Number of entries in the cache. |
|
Cache size. |
|
Counter of cache hits by cache type. |
|
Counter of cache misses. |
|
Total number of DNS resolution requests in different dimensions. |
|
Histogram of DNS request duration (bucket). |
|
Histogram of DNS request duration (count). |
|
Histogram of DNS request duration (sum). |
|
Histogram of the size of DNS request (bucket). |
|
Histogram of the size of DNS request (count). |
|
Histogram of the size of DNS request (sum). |
|
Number of DNS requests. |
|
Histogram of the size of DNS response (bucket). |
|
Histogram of the size of DNS response (count). |
|
Histogram of the size of DNS response (sum). |
|
DNS response codes and number of DNS response codes. |
|
Number of cache hits for each protocol and data flow. |
|
Number of cache misses for each protocol and data flow. |
|
Unhealthy upstream count. |
|
Count of failed health checks per upstream. |
|
Number of requests rejected due to excessive concurrent requests. |
|
Histogram of forward request duration (bucket). |
|
Histogram of forward request duration (count). |
|
Histogram of forward request duration (sum). |
|
Number of requests for each data flow. |
|
Number of responses to each data flow. |
|
Histogram of health request duration (bucket). |
|
Histogram of health request duration (count). |
|
Histogram of health request duration (sum). |
|
Number of health request failures. |
|
Timestamp of the last reload of the host file. |
|
Histogram of DNS programming duration (bucket). |
|
Histogram of DNS programming duration (count). |
|
Histogram of DNS programming duration (sum). |
|
Number of localhost requests. |
|
Number of nodecache setup errors. |
|
Number of responses for each Zone and Rcode. |
|
Number of DNS requests. |
|
Number of requests with the DNSSEC OK (DO) bit set. |
|
Number of requests with the DO bit set. |
|
Number of requests for each Zone and Type. |
|
Total number of panics. |
|
Whether a plugin is enabled. |
|
Number of last reload failures. |