The third-party jmx-exporter
service used by StackLight
for exporting the OpenContrail Cassandra metrics
may have a slow response time on the ntw
node where the Cassandra backup
is enabled. Usually, this is the ntw01
node.
You may also detect the following symptoms of the issue:
PrometheusTargetDown
alert for the jmx_cassandra_exporter
job
appears in the FIRING
state for the ntw0x
node.contrail-database-nodemgr
service status is initializing
.Workaround:
Log in to the ntw01
node.
Verify that the Cassandra snapshots are automatically backed up in
/var/backups/cassandra/
. Otherwise, manually back them up in
/var/lib/cassandra/data
.
Clear the Cassandra snapshots. For example:
doctrail controller nodetool -h localhost -p 7198 clearsnapshot
nodetool -h localhost -p 7198 clearsnapshot
If clearing of snapshots does not resolve the issue, increase the
scrape_interval
and scrape_timeout
values for
jmx_cassandra_exporter
:
Open your Git project repository with the Reclass model on the cluster level.
In cluster/<cluster_name>/stacklight/server.yml
,
modify the scrape
parameters. For example:
prometheus:
server:
target:
static:
jmx_cassandra_exporter:
scheme: http
metrics_path: /metrics
honor_labels: False
scrape_interval: 60s
scrape_timeout: 60s
Log in to the Salt Master node.
Apply the changes to the Reclass model:
salt 'cfg01*' state.apply reclass.storage
salt '*' saltutil.sync_all
Apply the following state:
salt -C 'I@prometheus:server' state.sls prometheus
Connect to Grafana as described in Connect to Grafana.
Navigate to the Cassandra dashboard.
Verify that the rate_interval value is more than 1m
.