MSR metrics exposed for Prometheus¶

Comprehensive detail on all of the metrics exposed by MSR is provided herein. For specific key metrics, refer to the Usage information, which offers valuable insights on interpreting the data and using it to troubleshoot your MSR deployment.

Registry metrics¶

Registry metrics capture essential MSR functionality, such as repository count, tag count, push events, and pull events.

Metrics often incorporate labels to differentiate specific attributes of the measured item. The table below provides a list of possible values for the labels associated with registry metrics:

Label	Possible values
`namespace`	Namespace name
`repository`	Repository name

repos¶

Description	Current number of repositories
Metric type	Gauge
Labels	None

public_repos¶

Description	Current number of public repositories
Metric type	Gauge
Labels	None

private_repos¶

Description	Current number of private repositories
Metric type	Gauge
Labels	None

pull_count¶

Description	Running total of image pulls
Metric type	Counter
Labels	None

pull_count_per_repo¶

Description	Running total of image pulls per repository
Metric type	Counter
Labels	`namespace`, `repository`

push_count¶

Description	Running total of image pushes
Metric type	Counter
Labels	None

push_count_per_repo¶

Description	Running total of image pushes per repository
Metric type	Counter
Labels	`namespace`, `repository`

tags¶

Description	Current number of image tags
Metric type	Gauge
Labels	None
Usage	If your tag count increases beyond your needs, you can enable tag pruning policies on individual repositories to manage the growth effectively. Note Tag pruning selectively removes image tags, but it does not eliminate the associated data blobs. To completely remove unwanted image tags and free up cluster resources, it is necessary that you schedule garbage collection as well.

tags_per_repo¶

Description	Current number of image tags per repository
Metric type	Gauge
Labels	`namespace`, `repository`
Usage	If an individual repository tag count increases beyond your needs, you can enable tag pruning policies to manage the growth effectively. Note Tag pruning selectively removes image tags, but it does not eliminate the associated data blobs. To completely remove unwanted image tags and free up cluster resources, it is necessary that you schedule garbage collection as well.

pruning_policy_enabled_repos¶

Description	Current number of repositories for which at least one pruning policy is enabled
Metric type	Gauge
Labels	None
Usage	To assess whether pruning policy usage should be increased across your cluster, compare this number with the total number of repositories.

Mirroring metrics¶

Mirroring metrics track the number of push and pull mirroring jobs, categorized by job status.

Considered as a whole, these metrics offer real-time insights into the performance of your mirroring jobs. For example, when you observe a simultaneous decrease in poll_mirror_running and an increase in poll_mirror_done, this provides immediate assurance that your poll mirroring configuration is functioning properly.

poll_mirror_waiting¶

Description	Current number of poll mirroring jobs with a ‘waiting’ status
Metric type	Gauge
Labels	None
Usage	If there is a significant number of poll mirroring jobs in the `waiting` state, consider updating the Jobrunner capacity configuration to allow a higher parallel execution of mirroring jobs.

poll_mirror_running¶

Description	Current number of poll mirroring jobs with a ‘running’ status
Metric type	Gauge
Labels	None

poll_mirror_done¶

Description	Running total of poll mirroring jobs with a ‘done’ status
Metric type	Counter
Labels	None

poll_mirror_errored¶

Description	Running total of poll mirroring jobs with an ‘errored’ status
Metric type	Counter
Labels	None
Usage	If there is a sudden surge in the number of poll mirroring jobs in the `errored` state, investigate the Jobrunner logs to troubleshoot the issue.

push_mirror_waiting¶

Description	Current number of push mirroring jobs with a ‘waiting’ status
Metric type	Gauge
Labels	None
Usage	If there is a significant number of push mirroring jobs in the `waiting` state, consider updating the Jobrunner capacity configuration to allow a higher parallel execution of mirroring jobs.

push_mirror_running¶

Description	Current number of push mirroring jobs with a ‘running’ status
Metric type	Gauge
Labels	None

push_mirror_done¶

Description	Running total of push mirroring jobs with a ‘done’ status
Metric type	Counter
Labels	None

push_mirror_errored¶

Description	Running total of push mirroring jobs with an ‘errored’ status
Metric type	Counter
Labels	None
Usage	If there is a sudden surge in the number of push mirroring jobs in the `errored` state, investigate the Jobrunner logs to troubleshoot the issue.

Authentication metrics¶

Authentication metrics monitor the count of CLI logins and active web UI sessions.

cli_login_count¶

Description	Running total of CLI logins made
Metric type	Counter
Labels	None
Usage	If you observe a sharp decline in CLI logins, investigate the Garant logs to troubleshoot the issue.

ui_sessions¶

Description	Current number of active user interface sessions
Metric type	Gauge
Labels	None
Usage	If you observe a sharp decline in active UI sessions, investigate the eNZi logs to troubleshoot the issue.

RethinkDB metrics¶

The metrics for RethinkDB are extracted from the system statistics and current issues tables, providing a broad range of information about your RethinkDB deployment.

Metrics often incorporate labels to differentiate specific attributes of the measured item. The table below provides a list of possible values for the labels associated with RethinkDB metrics:

Label	Possible values
`db`	Database name
`table`	Table name
`server`	Server name
`operation`	`read`, `written`

cluster_client_connections¶

Description	Current number of connections from the cluster
Metric type	Gauge
Labels	None

cluster_docs_per_second¶

Description	Current number of document reads and writes per second from the cluster
Metric type	Gauge
Labels	`operation`

server_client_connections¶

Description	Current number of client connections to the server
Metric type	Gauge
Labels	`server`

server_queries_per_second¶

Description	Current number of queries per second from the server
Metric type	Gauge
Labels	`server`

server_docs_per_second¶

Description	Current number of document reads and writes per second from the server
Metric type	Gauge
Labels	`server`, `operation`

table_docs_per_second¶

Description	Current number of document reads and writes per second from the table
Metric type	Gauge
Labels	`db`, `table`, `operation`
Usage	If you observe that certain tables have a high volume of reads or writes, it is advisable to evenly distribute the primary replicas associated with those tables across the RethinkDB servers. This approach ensures a balanced distribution of the cluster load, leading to improved performance across the system.

table_rows_count¶

Description	Current number of rows in the table
Metric type	Gauge
Labels	`db`, `table`

tablereplica_docs_per_second¶

Description	Current number of document reads and writes per second from the table replica
Metric type	Gauge
Labels	`db`, `table`, `server`, `operation`

tablereplica_cache_bytes¶

Description	Table replica cache size, in bytes
Metric type	Gauge
Labels	`db`, `table`, `server`

tablereplica_io¶

Description	Table replica byte reads and writes per second
Metric type	Gauge
Labels	`db`, `table`, `server`, `operation`

tablereplica_data_bytes¶

Description	Table replica size, in stored bytes
Metric type	Gauge
Labels	`db`, `table`, `server`

log_write_issues¶

Description	Current number of log write issues
Metric type	Gauge
Labels	None
Usage	Log write issues refer to situations where RethinkDB encounters failures while attempting to write to its log file. Refer to System current issues table in the official RethinkDB documentation for more information.

name_collision_issues¶

Description	Current number of name collision issues
Metric type	Gauge
Labels	None
Usage	Name collision issues arise when multiple servers, databases, or tables within the same database are assigned identical names. Refer to System current issues table in the official RethinkDB documentation for more information.

outdated_index_issues¶

Description	Current number of outdated index issues
Metric type	Gauge
Labels	None
Usage	Outdated index issues occur when indexes that were created using an older version of RethinkDB need to be rebuilt due to changes in the indexing mechanism employed by RethinkDB Query Language (ReQL). Refer to System current issues table in the official RethinkDB documentation for more information.

total_availability_issues¶

Description	Current number of total availability issues
Metric type	Gauge
Labels	None
Usage	Total availability issues occur when a table within the RethinkDB cluster is missing at least one replica. Refer to System current issues table in the official RethinkDB documentation for more information.

memory_availability_issues¶

Description	Current number of memory availability issues
Metric type	Gauge
Labels	None
Usage	Memory availability issues arise when a page fault occurs on a RethinkDB server and the system starts using swap space. Refer to System current issues table in the official RethinkDB documentation for more information.

connectivity_issues¶

Description	Current number of connectivity issues
Metric type	Gauge
Labels	None
Usage	Connectivity issues occur when certain servers within a RethinkDB cluster are unable to establish a connection or communicate with all other servers in the cluster. Refer to System current issues table in the official RethinkDB documentation for more information.

other_issues¶

Description	Current number of uncategorized issues
Metric type	Gauge
Labels	None
Usage	Refer to your RethinkDB logs to diagnose the issue. Note If the number of `other_issues` is greater than zero, it indicates the need to expand the existing set of metrics to cover those additional issue types. Please reach out to Mirantis and inform us that you are seeing `other_issues` tracked in your cluster.

table_size¶

Description	Table size in MB
Metric type	Gauge
Labels	`db`, `table`
Usage	When a specific table in your MSR deployment grows unchecked, it may indicate a potential issue with the corresponding functionality. For instance, if the size of the `tags` table is increasing beyond expectations, it could be a sign that your pruning policies, which are responsible for managing tag retention, are not functioning properly. Similarly, if the `blobs` table is growing more than anticipated, it could suggest a problem with the garbage collection process, which is responsible for removing unused data blobs.

Prometheus scrape metrics¶

Prometheus scrape metrics capture the duration of each metrics scrape and the number of errors returned during the process.

scrape_latency¶

Description	Duration of metrics collection
Metric type	Gauge
Labels	None
Usage	Elevated metrics scrape latency can serve as an indicator that additional resources should be allocated to your Prometheus server.

scrape_errors¶

Description	Current number of errors that occurred during metrics collection
Metric type	Gauge
Labels	None
Usage	Since MSR metrics depend heavily on the use of RethinkDB, any scrape errors encountered are likely to be caused by issues related to RethinkDB itself. To diagnose and troubleshoot the problem, refer to the logs of your RethinkDB deployment.