The third-party jmx-exporter service used by StackLight
for exporting the OpenContrail Cassandra metrics
may have a slow response time on the ntw node where the Cassandra backup
is enabled. Usually, this is the ntw01 node.
You may also detect the following symptoms of the issue:
PrometheusTargetDown alert for the jmx_cassandra_exporter job
appears in the FIRING state for the ntw0x node.contrail-database-nodemgr service status is initializing.Workaround:
Log in to the ntw01 node.
Verify that the Cassandra snapshots are automatically backed up in
/var/backups/cassandra/. Otherwise, manually back them up in
/var/lib/cassandra/data.
Clear the Cassandra snapshots. For example:
doctrail controller nodetool -h localhost -p 7198 clearsnapshot
nodetool -h localhost -p 7198 clearsnapshot
If clearing of snapshots does not resolve the issue, increase the
scrape_interval and scrape_timeout values for
jmx_cassandra_exporter:
Open your Git project repository with the Reclass model on the cluster level.
In cluster/<cluster_name>/stacklight/server.yml,
modify the scrape parameters. For example:
prometheus:
server:
target:
static:
jmx_cassandra_exporter:
scheme: http
metrics_path: /metrics
honor_labels: False
scrape_interval: 60s
scrape_timeout: 60s
Log in to the Salt Master node.
Apply the changes to the Reclass model:
salt 'cfg01*' state.apply reclass.storage
salt '*' saltutil.sync_all
Apply the following state:
salt -C 'I@prometheus:server' state.sls prometheus
Connect to Grafana as described in Connect to Grafana.
Navigate to the Cassandra dashboard.
Verify that the rate_interval value is more than 1m.