Monitor and troubleshoot

Monitor and troubleshoot

Monitor MSR

Mirantis Secure Registry is a Dockerized application. To monitor it, you can use the same tools and techniques you’re already using to monitor other containerized applications running on your cluster. One way to monitor MSR is using the monitoring capabilities of Docker Universal Control Plane.

In your browser, log in to Mirantis Kubernetes Engine (MKE), and navigate to the Stacks page. If you have MSR set up for high-availability, then all the MSR replicas are displayed.

To check the containers for the MSR replica, click the replica you want to inspect, click Inspect Resource, and choose Containers.

Now you can drill into each MSR container to see its logs and find the root cause of the problem.

Health check endpoints

MSR also exposes several endpoints you can use to assess if a MSR replica is healthy or not:

  • /_ping: Checks if the MSR replica is healthy, and returns a simple json response. This is useful for load balancing or other automated health check tasks.

  • /nginx_status: Returns the number of connections being handled by the NGINX front-end used by MSR.

  • /api/v0/meta/cluster_status: Returns extensive information about all MSR replicas.

Cluster status

The /api/v0/meta/cluster_status endpoint requires administrator credentials, and returns a JSON object for the entire cluster as observed by the replica being queried. You can authenticate your requests using HTTP basic auth.

curl -ksL -u <user>:<pass> https://<msr-domain>/api/v0/meta/cluster_status
{
  "current_issues": [
   {
    "critical": false,
    "description": "... some replicas are not ready. The following servers are
                    not reachable: dtr_rethinkdb_f2277ad178f7",
  }],
  "replica_health": {
    "f2277ad178f7": "OK",
    "f3712d9c419a": "OK",
    "f58cf364e3df": "OK"
  },
}

You can find health status on the current_issues and replica_health arrays. If this endpoint doesn’t provide meaningful information when trying to troubleshoot, try troubleshooting using logs.

Check notary audit logs

Docker Content Trust (DCT) keeps audit logs of changes made to trusted repositories. Every time you push a signed image to a repository, or delete trust data for a repository, DCT logs that information.

These logs are only available from the MSR API.

Get an authentication token

To access the audit logs you need to authenticate your requests using an authentication token. You can get an authentication token for all repositories, or one that is specific to a single repository.

curl --insecure --silent \
  --user <user>:<password> \
  "https://<dtr-url>/auth/token?realm=dtr&service=dtr&scope=registry:catalog:*"
  

curl --insecure --silent \
  --user <user>:<password> \
  "https://<dtr-url>/auth/token?realm=dtr&service=dtr&scope=repository:<dtr-url>/<repository>:pull"
  

MSR returns a JSON file with a token, even when the user doesn’t have access to the repository to which they requested the authentication token. This token doesn’t grant access to MSR repositories.

The JSON file returned has the following structure:

{
  "token": "<token>",
  "access_token": "<token>",
  "expires_in": "<expiration in seconds>",
  "issued_at": "<time>"
}

Changefeed API

Once you have an authentication token you can use the following endpoints to get audit logs:

URL

Description

Authorization

GET /v2/_trust/changefeed

Get audit logs for all repositories.

Global scope token

GET /v2/<msr-url>/<repository>/_trust/changefeed

Get audit logs for a specific repository.

Repositorhy-specific token

Both endpoints have the following query string parameters:

Field name

Required

Type

Description

change_id

Yes

String

A non-inclusive starting change ID from which to start returning results. This will typically be the first or last change ID from the previous page of records requested, depending on which direction your are paging in.

The value 0 indicates records should be returned starting from the beginning of time.

The value 1 indicates records should be returned starting from the most recent record. If 1 is provided, the implementation will also assume the records value is meant to be negative, regardless of the given sign.

records

Yes

String integer

The number of records to return. A negative value indicates the number of records preceding the change_id should be returned. Records are always returned sorted from oldest to newest.

Response

The response is a JSON like:

{
  "count": 1,
  "records": [
    {
      "ID": "0a60ec31-d2aa-4565-9b74-4171a5083bef",
      "CreatedAt": "2017-11-06T18:45:58.428Z",
      "GUN": "msr.example.org/library/wordpress",
      "Version": 1,
      "SHA256": "a4ffcae03710ae61f6d15d20ed5e3f3a6a91ebfd2a4ba7f31fc6308ec6cc3e3d",
      "Category": "update"
    }
  ]
}

Below is the description for each of the fields in the response:

Field name

Description

count

The number of records returned.

ID

The ID of the change record. Should be used in the change_id field of requests to provide a non-exclusive starting index. It should be treated as an opaque value that is guaranteed to be unique within an instance of notary.

CreatedAt

The time the change happened.

GUN

The MSR repository that was changed.

Version

The version that the repository was updated to. This increments every time there’s a change to the trust repository.

This is always 0 for events representing trusted data being removed from the repository.

SHA256

The checksum of the timestamp being updated to. This can be used with the existing notary APIs to request said timestamp.

This is always an empty string for events representing trusted data being removed from the repository

Category

The kind of change that was made to the trusted repository. Can be update, or deletion.

The results only include audit logs for events that happened more than 60 seconds ago, and are sorted from oldest to newest.

Even though the authentication API always returns a token, the changefeed API validates if the user has access to see the audit logs or not:

  • If the user is an admin they can see the audit logs for any repositories,

  • All other users can only see audit logs for repositories they have read access.

Troubleshooting MSR

This guide contains tips and tricks for troubleshooting MSR problems.

Troubleshoot overlay networks

High availability in MSR depends on swarm overlay networking. One way to test if overlay networks are working correctly is to deploy containers to the same overlay network on different nodes and see if they can ping one another.

Use SSH to log into a node and run:

docker run -it --rm \
  --net dtr-ol --name overlay-test1 \
  --entrypoint sh mirantis/dtr

Then use SSH to log into another node and run:

docker run -it --rm \
  --net dtr-ol --name overlay-test2 \
  --entrypoint ping mirantis/dtr -c 3 overlay-test1

If the second command succeeds, it indicates overlay networking is working correctly between those nodes.

You can run this test with any attachable overlay network and any Docker image that has sh and ping.

Access RethinkDB directly

MSR uses RethinkDB for persisting data and replicating it across replicas. It might be helpful to connect directly to the RethinkDB instance running on a MSR replica to check the MSR internal state.

Warning

Modifying RethinkDB directly is not supported and may cause problems.

via RethinkCLI

The RethinkCLI can be run from a separate image in the mirantis organization. Note that the commands below are using separate tags for non-interactive and interactive modes.

Non-interactive

Use SSH to log into a node that is running a MSR replica, and run the following:

# List problems in the cluster detected by the current node.
REPLICA_ID=$(docker container ls --filter=name=dtr-rethink --format '{{.Names}}' | cut -d'/' -f2 | cut -d'-' -f3 | head -n 1) && echo 'r.db("rethinkdb").table("current_issues")' | docker run --rm -i --net dtr-ol -v "dtr-ca-${REPLICA_ID}:/ca" -e MSR_REPLICA_ID=$REPLICA_ID mirantis/rethinkcli:v2.2.0-ni non-interactive

On a healthy cluster the output will be [].

Interactive

Starting in DTR 2.5.5, you can run RethinkCLI from a separate image. First, set an environment variable for your MSR replica ID:

REPLICA_ID=$(docker inspect -f '{{.Name}}' $(docker ps -q -f name=dtr-rethink) | cut -f 3 -d '-')

RethinkDB stores data in different databases that contain multiple tables. Run the following command to get into interactive mode and query the contents of the DB:

docker run -it --rm --net dtr-ol -v dtr-ca-$REPLICA_ID:/ca mirantis/rethinkcli:v2.3.0 $REPLICA_ID
# List problems in the cluster detected by the current node.
> r.db("rethinkdb").table("current_issues")
[]

# List all the DBs in RethinkDB
> r.dbList()
[ 'dtr2',
  'jobrunner',
  'notaryserver',
  'notarysigner',
  'rethinkdb' ]

# List the tables in the dtr2 db
> r.db('dtr2').tableList()
[ 'blob_links',
  'blobs',
  'client_tokens',
  'content_caches',
  'events',
  'layer_vuln_overrides',
  'manifests',
  'metrics',
  'namespace_team_access',
  'poll_mirroring_policies',
  'promotion_policies',
  'properties',
  'pruning_policies',
  'push_mirroring_policies',
  'repositories',
  'repository_team_access',
  'scanned_images',
  'scanned_layers',
  'tags',
  'user_settings',
  'webhooks' ]

# List the entries in the repositories table
> r.db('dtr2').table('repositories')
[ { enableManifestLists: false,
    id: 'ac9614a8-36f4-4933-91fa-3ffed2bd259b',
    immutableTags: false,
    name: 'test-repo-1',
    namespaceAccountID: 'fc3b4aec-74a3-4ba2-8e62-daed0d1f7481',
    namespaceName: 'admin',
    pk: '3a4a79476d76698255ab505fb77c043655c599d1f5b985f859958ab72a4099d6',
    pulls: 0,
    pushes: 0,
    scanOnPush: false,
    tagLimit: 0,
    visibility: 'public' },
  { enableManifestLists: false,
    id: '9f43f029-9683-459f-97d9-665ab3ac1fda',
    immutableTags: false,
    longDescription: '',
    name: 'testing',
    namespaceAccountID: 'fc3b4aec-74a3-4ba2-8e62-daed0d1f7481',
    namespaceName: 'admin',
    pk: '6dd09ac485749619becaff1c17702ada23568ebe0a40bb74a330d058a757e0be',
    pulls: 0,
    pushes: 0,
    scanOnPush: false,
    shortDescription: '',
    tagLimit: 1,
    visibility: 'public' } ]

Individual DBs and tables are a private implementation detail and may change in MSR from version to version, but you can always use dbList() and tableList() to explore the contents and data structure.

Learn more about RethinkDB queries.

via API

To check on the overall status of your MSR cluster without interacting with RethinkCLI, run the following API request:

curl -u admin:$TOKEN -X GET "https://<msr-url>/api/v0/meta/cluster_status" -H "accept: application/json"
Example API Response
{
  "rethink_system_tables": {
    "cluster_config": [
      {
        "heartbeat_timeout_secs": 10,
        "id": "heartbeat"
      }
    ],
    "current_issues": [],
    "db_config": [
      {
        "id": "339de11f-b0c2-4112-83ac-520cab68d89c",
        "name": "notaryserver"
      },
      {
        "id": "aa2e893f-a69a-463d-88c1-8102aafebebc",
        "name": "dtr2"
      },
      {
        "id": "bdf14a41-9c31-4526-8436-ab0fed00c2fd",
        "name": "jobrunner"
      },
      {
        "id": "f94f0e35-b7b1-4a2f-82be-1bdacca75039",
        "name": "notarysigner"
      }
    ],
    "server_status": [
      {
        "id": "9c41fbc6-bcf2-4fad-8960-d117f2fdb06a",
        "name": "dtr_rethinkdb_5eb9459a7832",
        "network": {
          "canonical_addresses": [
            {
              "host": "dtr-rethinkdb-5eb9459a7832.dtr-ol",
              "port": 29015
            }
          ],
          "cluster_port": 29015,
          "connected_to": {
            "dtr_rethinkdb_56b65e8c1404": true
          },
          "hostname": "9e83e4fee173",
          "http_admin_port": "<no http admin>",
          "reql_port": 28015,
          "time_connected": "2019-02-15T00:19:22.035Z"
        },
       }
     ...
    ]
  }
}

Recover from an unhealthy replica

When a MSR replica is unhealthy or down, the MSR web UI displays a warning:

Warning: The following replicas are unhealthy: 59e4e9b0a254; Reasons: Replica reported health too long ago: 2017-02-18T01:11:20Z; Replicas 000000000000, 563f02aba617 are still healthy.

To fix this, you should remove the unhealthy replica from the MSR cluster, and join a new one. Start by running:

docker run -it --rm \
  mirantis/dtr:2.8.2 remove \
  --ucp-insecure-tls

And then:

docker run -it --rm \
  mirantis/dtr:2.8.2 join \
  --ucp-node <mke-node-name> \
  --ucp-insecure-tls