MSR metrics exposed for Prometheus

Comprehensive detail on all of the metrics exposed by MSR is provided herein. For specific key metrics, refer to the Usage information, which offers valuable insights on interpreting the data and using it to troubleshoot your MSR deployment.

Registry metrics

Registry metrics capture essential MSR functionality, such as repository count, tag count, push events, and pull events.

Metrics often incorporate labels to differentiate specific attributes of the measured item. The table below provides a list of possible values for the labels associated with registry metrics:

Label

Possible values

namespace

Namespace name

repository

Repository name

repos

Description

Current number of repositories

Metric type

Gauge

Labels

None

public_repos

Description

Current number of public repositories

Metric type

Gauge

Labels

None

private_repos

Description

Current number of private repositories

Metric type

Gauge

Labels

None

pull_count

Description

Running total of image pulls

Metric type

Counter

Labels

None

pull_count_per_repo

Description

Running total of image pulls per repository

Metric type

Counter

Labels

namespace, repository

push_count

Description

Running total of image pushes

Metric type

Counter

Labels

None

push_count_per_repo

Description

Running total of image pushes per repository

Metric type

Counter

Labels

namespace, repository

tags

Description

Current number of image tags

Metric type

Gauge

Labels

None

Usage

If your tag count increases beyond your needs, you can enable tag pruning policies on individual repositories to manage the growth effectively.

Note

Tag pruning selectively removes image tags, but it does not eliminate the associated data blobs. To completely remove unwanted image tags and free up cluster resources, it is necessary that you schedule garbage collection as well.

tags_per_repo

Description

Current number of image tags per repository

Metric type

Gauge

Labels

namespace, repository

Usage

If an individual repository tag count increases beyond your needs, you can enable tag pruning policies to manage the growth effectively.

Note

Tag pruning selectively removes image tags, but it does not eliminate the associated data blobs. To completely remove unwanted image tags and free up cluster resources, it is necessary that you schedule garbage collection as well.

pruning_policy_enabled_repos

Description

Current number of repositories for which at least one pruning policy is enabled

Metric type

Gauge

Labels

None

Usage

To assess whether pruning policy usage should be increased across your cluster, compare this number with the total number of repositories.

Mirroring metrics

Mirroring metrics track the number of push and pull mirroring jobs, categorized by job status.

Considered as a whole, these metrics offer real-time insights into the performance of your mirroring jobs. For example, when you observe a simultaneous decrease in poll_mirror_running and an increase in poll_mirror_done, this provides immediate assurance that your poll mirroring configuration is functioning properly.

poll_mirror_waiting

Description

Current number of poll mirroring jobs with a ‘waiting’ status

Metric type

Gauge

Labels

None

Usage

If there is a significant number of poll mirroring jobs in the waiting state, consider updating the Jobrunner capacity configuration to allow a higher parallel execution of mirroring jobs.

poll_mirror_running

Description

Current number of poll mirroring jobs with a ‘running’ status

Metric type

Gauge

Labels

None

poll_mirror_done

Description

Running total of poll mirroring jobs with a ‘done’ status

Metric type

Counter

Labels

None

poll_mirror_errored

Description

Running total of poll mirroring jobs with an ‘errored’ status

Metric type

Counter

Labels

None

Usage

If there is a sudden surge in the number of poll mirroring jobs in the errored state, investigate the Jobrunner logs to troubleshoot the issue.

push_mirror_waiting

Description

Current number of push mirroring jobs with a ‘waiting’ status

Metric type

Gauge

Labels

None

Usage

If there is a significant number of push mirroring jobs in the waiting state, consider updating the Jobrunner capacity configuration to allow a higher parallel execution of mirroring jobs.

push_mirror_running

Description

Current number of push mirroring jobs with a ‘running’ status

Metric type

Gauge

Labels

None

push_mirror_done

Description

Running total of push mirroring jobs with a ‘done’ status

Metric type

Counter

Labels

None

push_mirror_errored

Description

Running total of push mirroring jobs with an ‘errored’ status

Metric type

Counter

Labels

None

Usage

If there is a sudden surge in the number of push mirroring jobs in the errored state, investigate the Jobrunner logs to troubleshoot the issue.

Authentication metrics

Authentication metrics monitor the count of CLI logins and active web UI sessions.

cli_login_count

Description

Running total of CLI logins made

Metric type

Counter

Labels

None

Usage

If you observe a sharp decline in CLI logins, investigate the Garant logs to troubleshoot the issue.

ui_sessions

Description

Current number of active user interface sessions

Metric type

Gauge

Labels

None

Usage

If you observe a sharp decline in active UI sessions, investigate the eNZi logs to troubleshoot the issue.

RethinkDB metrics

The metrics for RethinkDB are extracted from the system statistics and current issues tables, providing a broad range of information about your RethinkDB deployment.

Metrics often incorporate labels to differentiate specific attributes of the measured item. The table below provides a list of possible values for the labels associated with RethinkDB metrics:

Label

Possible values

db

Database name

table

Table name

server

Server name

operation

read, written

cluster_client_connections

Description

Current number of connections from the cluster

Metric type

Gauge

Labels

None

cluster_docs_per_second

Description

Current number of document reads and writes per second from the cluster

Metric type

Gauge

Labels

operation

server_client_connections

Description

Current number of client connections to the server

Metric type

Gauge

Labels

server

server_queries_per_second

Description

Current number of queries per second from the server

Metric type

Gauge

Labels

server

server_docs_per_second

Description

Current number of document reads and writes per second from the server

Metric type

Gauge

Labels

server, operation

table_docs_per_second

Description

Current number of document reads and writes per second from the table

Metric type

Gauge

Labels

db, table, operation

Usage

If you observe that certain tables have a high volume of reads or writes, it is advisable to evenly distribute the primary replicas associated with those tables across the RethinkDB servers. This approach ensures a balanced distribution of the cluster load, leading to improved performance across the system.

table_rows_count

Description

Current number of rows in the table

Metric type

Gauge

Labels

db, table

tablereplica_docs_per_second

Description

Current number of document reads and writes per second from the table replica

Metric type

Gauge

Labels

db, table, server, operation

tablereplica_cache_bytes

Description

Table replica cache size, in bytes

Metric type

Gauge

Labels

db, table, server

tablereplica_io

Description

Table replica byte reads and writes per second

Metric type

Gauge

Labels

db, table, server, operation

tablereplica_data_bytes

Description

Table replica size, in stored bytes

Metric type

Gauge

Labels

db, table, server

log_write_issues

Description

Current number of log write issues

Metric type

Gauge

Labels

None

Usage

Log write issues refer to situations where RethinkDB encounters failures while attempting to write to its log file. Refer to System current issues table in the official RethinkDB documentation for more information.

name_collision_issues

Description

Current number of name collision issues

Metric type

Gauge

Labels

None

Usage

Name collision issues arise when multiple servers, databases, or tables within the same database are assigned identical names. Refer to System current issues table in the official RethinkDB documentation for more information.

outdated_index_issues

Description

Current number of outdated index issues

Metric type

Gauge

Labels

None

Usage

Outdated index issues occur when indexes that were created using an older version of RethinkDB need to be rebuilt due to changes in the indexing mechanism employed by RethinkDB Query Language (ReQL). Refer to System current issues table in the official RethinkDB documentation for more information.

total_availability_issues

Description

Current number of total availability issues

Metric type

Gauge

Labels

None

Usage

Total availability issues occur when a table within the RethinkDB cluster is missing at least one replica. Refer to System current issues table in the official RethinkDB documentation for more information.

memory_availability_issues

Description

Current number of memory availability issues

Metric type

Gauge

Labels

None

Usage

Memory availability issues arise when a page fault occurs on a RethinkDB server and the system starts using swap space. Refer to System current issues table in the official RethinkDB documentation for more information.

connectivity_issues

Description

Current number of connectivity issues

Metric type

Gauge

Labels

None

Usage

Connectivity issues occur when certain servers within a RethinkDB cluster are unable to establish a connection or communicate with all other servers in the cluster. Refer to System current issues table in the official RethinkDB documentation for more information.

other_issues

Description

Current number of uncategorized issues

Metric type

Gauge

Labels

None

Usage

Refer to your RethinkDB logs to diagnose the issue.

Note

If the number of other_issues is greater than zero, it indicates the need to expand the existing set of metrics to cover those additional issue types. Please reach out to Mirantis and inform us that you are seeing other_issues tracked in your cluster.

table_size

Description

Table size in MB

Metric type

Gauge

Labels

db, table

Usage

When a specific table in your MSR deployment grows unchecked, it may indicate a potential issue with the corresponding functionality. For instance, if the size of the tags table is increasing beyond expectations, it could be a sign that your pruning policies, which are responsible for managing tag retention, are not functioning properly. Similarly, if the blobs table is growing more than anticipated, it could suggest a problem with the garbage collection process, which is responsible for removing unused data blobs.

Prometheus scrape metrics

Prometheus scrape metrics capture the duration of each metrics scrape and the number of errors returned during the process.

scrape_latency

Description

Duration of metrics collection

Metric type

Gauge

Labels

None

Usage

Elevated metrics scrape latency can serve as an indicator that additional resources should be allocated to your Prometheus server.

scrape_errors

Description

Current number of errors that occurred during metrics collection

Metric type

Gauge

Labels

None

Usage

Since MSR metrics depend heavily on the use of RethinkDB, any scrape errors encountered are likely to be caused by issues related to RethinkDB itself. To diagnose and troubleshoot the problem, refer to the logs of your RethinkDB deployment.

See also