Slow response time from JMX Exporter

Slow response time from JMX Exporter

The third-party jmx-exporter service used by StackLight for exporting the OpenContrail Cassandra metrics may have a slow response time on the ntw node where the Cassandra backup is enabled. Usually, this is the ntw01 node.

You may also detect the following symptoms of the issue:

  • Grafana does not display metrics for Cassandra.

  • The PrometheusTargetDown alert for the jmx_cassandra_exporter job appears in the FIRING state for the ntw0x node.

  • The contrail-database-nodemgr service status is initializing.

Workaround:

  1. Log in to the ntw01 node.

  2. Verify that the Cassandra snapshots are automatically backed up in /var/backups/cassandra/. Otherwise, manually back them up in /var/lib/cassandra/data.

  3. Clear the Cassandra snapshots. For example:

    • For OpenContrail 4.x:

    doctrail controller nodetool -h localhost -p 7198 clearsnapshot
    
    • For OpenContrail 3.2:

    nodetool -h localhost -p 7198 clearsnapshot
    
  4. If clearing of snapshots does not resolve the issue, increase the scrape_interval and scrape_timeout values for jmx_cassandra_exporter:

    1. Open your Git project repository with the Reclass model on the cluster level.

    2. In cluster/<cluster_name>/stacklight/server.yml, modify the scrape parameters. For example:

      prometheus:
        server:
          target:
            static:
              jmx_cassandra_exporter:
                scheme: http
                metrics_path: /metrics
                honor_labels: False
                scrape_interval: 60s
                scrape_timeout: 60s
      
    3. Log in to the Salt Master node.

    4. Apply the changes to the Reclass model:

      salt 'cfg01*' state.apply reclass.storage
      salt '*' saltutil.sync_all
      
    5. Apply the following state:

      salt -C 'I@prometheus:server' state.sls prometheus
      
    6. Connect to Grafana as described in Connect to Grafana.

    7. Navigate to the Cassandra dashboard.

    8. Verify that the rate_interval value is more than 1m.