Verify StackLight after configuration

This section describes how to verify StackLight after configuring its parameters as described in StackLight configuration procedure and StackLight configuration parameters. Perform the verification procedure described for a particular modified StackLight key.

Verify StackLight configuration of an OpenStack cluster

Key

Verification procedure

  • externalFQDNs.enabled

  • openstack.insecure

  1. In the Prometheus web UI, navigate to Status > Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

  • openstack.enabled

  • openstack.namespace

  1. In the Grafana web UI, verify that the OpenStack dashboards are present and not empty.

  2. In the Prometheus web UI, click Alerts and verify that the OpenStack alerts are present in the list of alerts.

openstack.gnocchi.enabled

  1. In the Grafana web UI, verify that the Gnocchi dashboard is present and not empty. Alternatively, verify that the Gnocchi dashboard ConfigMap is present:

    kubectl get cm -n stacklight \
    grafana-dashboards-default-gnocchi
    
  2. In the OpenSearch Dashboards web UI, verify that logs for the gnocchi-metricd and gnocchi-api loggers are present.

openstack.ironic.enabled

  1. In the Grafana web UI, verify that the Ironic dashboard is present and not empty.

  2. In the Prometheus web UI, click Alerts and verify that the Ironic* alerts are present in the list of alerts.

  • openstack.rabbitmq.credentialsConfig

  • openstack.rabbitmq.credentialsDiscovery

In the OpenSearch Dashboards web UI, click Discover and verify that the audit-* and notifications-* indexes contain documents.

  • openstack.telegraf.credentialsConfig

  • openstack.telegraf.credentialsDiscovery

  • openstack.telegraf.interval

  • openstack.telegraf.insecure

  • openstack.telegraf.skipPublicEndpoints

In the Grafana web UI, verify that the OpenStack dashboards are present and not empty.

tungstenFabricMonitoring.enabled

  1. In the Grafana web UI, verify that the Tungsten Fabric dashboards are present and not empty.

  2. In the Prometheus web UI, click Alerts and verify that the Tungsten Fabric alerts are present in the list of alerts.

Verify StackLight configuration of a MOSK cluster

Key

Verification procedure

alerta.enabled

Verify that Alerta is present in the list of StackLight resources. An empty output indicates that Alerta is disabled.

kubectl get all -n stacklight -l app=alerta
  • alertmanagerSimpleConfig.email

  • alertmanagerSimpleConfig.email.enabled

  • alertmanagerSimpleConfig.email.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the Email receiver and route.

alertmanagerSimpleConfig.genericReceivers

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended receiver(s).

alertmanagerSimpleConfig.genericRoutes

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended route(s).

alertmanagerSimpleConfig.inhibitRules.enabled

Run the following command. An empty output indicates either a failure or that the feature is disabled.

kubectl  get cm -n stacklight prometheus-alertmanager -o \
yaml | grep -A 6 inhibit_rules
  • alertmanagerSimpleConfig.msteams.enabled

  • alertmanagerSimpleConfig.msteams.url

  • alertmanagerSimpleConfig.msteams.route

  1. Verify that the Prometheus Microsoft Teams pod is up and running:

    kubectl get pods -n stacklight -l \
    'app=prometheus-msteams'
    
  2. Verify that the Prometheus Microsoft Teams pod logs have no errors:

    kubectl logs -f -n stacklight -l \
    'app=prometheus-msteams'
    
  3. Verify that notifications are being sent to the Microsoft Teams channel.

  • alertmanagerSimpleConfig.salesForce.enabled

  • alertmanagerSimpleConfig.salesForce.auth

  • alertmanagerSimpleConfig.salesForce.route

  1. Verify that sf-notifier is enabled. The output must include the sf-notifier pod name, 1/1 in the READY field and Running in the STATUS field.

    kubectl get pods -n stacklight
    
  2. Verify that sf-notifier successfully authenticates to Salesforce. The output must include the Salesforce authentication successful line.

    kubectl logs -f -n stacklight <sf-notifier-pod-name>
    
  3. In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-salesforce receiver and route.

alertmanagerSimpleConfig.salesForce.feed_enabled

  • Verify that the sf-notifier pod logs include Creating feed item messages. For such messages to appear in logs, DEBUG logging level must be set up.

  • Verify through Salesforce:

    1. Log in to the Salesforce web UI.

    2. Click the Feed tab for a case created by sf-notifier.

    3. Verify that All Messages gets updated.

alertmanagerSimpleConfig.salesForce.link_prometheus

Verify that SF_NOTIFIER_ADD_LINKS has changed to true or false according to your customization:

kubectl get deployment sf-notifier \
-o=jsonpath='{.spec.template.spec.containers[0].env}' | jq .

alertmanagerSimpleConfig.serviceNow

  1. Verify that the alertmanager-webhook-servicenow pod is up and running:

    kubectl get pods -n stacklight -l \
    'app=alertmanager-webhook-servicenow'
    
  2. Verify that authentication to ServiceNow was successful. The output should include ServiceNow authentication successful. In case of authentication failure, the ServiceNowAuthFailure alert will raise.

    kubectl logs -f -n stacklight \
    <alertmanager-webhook-servicenow-pod-name>
    
  3. In your ServiceNow instance, verify that the Watchdog alert appears in the Incident table. Once the incident is created, the pod logs should include a line similar to Created Incident: bef260671bdb2010d7b540c6cc4bcbed.

In case of any failure:

  • Verify that your ServiceNow instance is not in hibernation.

  • Verify that the service user credentials, table name, and alert_id_field are correct.

  • Verify that the ServiceNow user has access to the table with permission to read, create, and update records.

  • alertmanagerSimpleConfig.slack.enabled

  • alertmanagerSimpleConfig.slack.api_url

  • alertmanagerSimpleConfig.slack.channel

  • alertmanagerSimpleConfig.slack.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-slack receiver and route.

blackboxExporter.customModules

  1. Verify that your module is present in the list of modules. It can take up to 10 minutes for the module to appear in the ConfigMap.

    kubectl get cm prometheus-blackbox-exporter -n stacklight \
    -o=jsonpath='{.data.blackbox\.yaml}'
    
  2. Review the configmap-reload container logs to verify that the reload happened successfully. It can take up to 1 minute for reload to happen after the module appears in the ConfigMap.

    kubectl logs -l app=prometheus-blackbox-exporter -n stacklight -c \
    configmap-reload
    

blackboxExporter.timeoutOffset

Verify that the args parameter of the blackbox-exporter container contains the specified --timeout-offset:

kubectl get deployment.apps/prometheus-blackbox-exporter -n stacklight \
-o=jsonpath='{.spec.template.spec.containers[?(@.name=="blackbox-exporter")].args}'

For example, for blackboxExporter.timeoutOffset set to 0.1, the output should include ["--config.file=/config/blackbox.yaml","--timeout-offset=0.1"]. It can take up to 10 minutes for the parameter to be populated.

ceph.enabled

  1. In the Grafana web UI, verify that Ceph dashboards are present in the list of dashboards and are populated with data.

  2. In the Prometheus web UI, click Alerts and verify that the list of alerts contains Ceph* alerts.

  • clusterSize

  • resourcesPerClusterSize Deprecated

  • resources

  1. Obtain the list of pods:

    kubectl get po -n stacklight
    
  2. Verify that the desired resource limits or requests are set in the resources section of every container in the pod:

    kubectl get po <pod_name> -n stacklight -o yaml
    
elasticsearch.logstashRetentionTime
Removed in MCC 2.26.0 (17.1.0, 16.1.0)

Verify that the unit_count parameter contains the desired number of days:

kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'

elasticsearch.persistentVolumeClaimSize

Verify that the PVC(s) capacity is equal or higher (in case of statically provisioned volumes) than specified:

kubectl get pvc -n stacklight -l "app=opensearch-master"
  • elasticsearch.retentionTime

  • logging.retentionTime
    Removed in MCC 2.26.0 (17.1.0, 16.1.0)
  1. Verify that configMap includes the new data. The output should include the changed values.

    kubectl get cm elasticsearch-curator-config -n stacklight --kubeconfig=<pathToKubeconfig> -o yaml
    
  2. Verify that the elasticsearch-curator-{JOB_ID}-{POD_ID} job has successfully completed:

    kubectl logs elasticsearch-curator-<jobID>-<podID> -n stacklight --kubeconfig=<pathToKubeconfig>
    
  • externalEndpointMonitoring.enabled

  • externalEndpointMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

grafana.homeDashboard

In the Grafana web UI, verify that the desired dashboard is set as a home dashboard.

grafana.renderer.enabled
Removed in MCC 2.27.0 (17.2.0, 16.2.0)

Verify the Grafana Image Renderer. If set to true, the output should include HTTP Server started, listening at http://localhost:8081.

kubectl logs -f -n stacklight -l app=grafana \
--container grafana-renderer

highAvailabilityEnabled

Verify the number of service replicas for the HA or non-HA StackLight mode. For details, see Deployment architecture.

kubectl get sts -n stacklight
  • ironic.endpoint

  • ironic.insecure

In the Grafana web UI, verify that the Ironic BM dashboard displays valuable data (no false-positive or empty panels).

logging.dashboardsExtraConfig

Verify that the customization has applied:

kubectl -n stacklight get cm opensearch-dashboards -o=jsonpath='{.data}'

Example of system response:

{"opensearch_dashboards.yml":"opensearch.hosts: http://opensearch-master:9200\
\nopensearch.requestTimeout: 60000\
\nopensearchDashboards.defaultAppId: dashboard/2d53aa40-ad1f-11e9-9839-052bda0fdf49\
\nserver:\
\n  host: 0.0.0.0\
\n  name: opensearch-dashboards\n"}

logging.enabled

Verify that OpenSearch, Fluentd, and OpenSearch Dashboards are present in the list of StackLight resources. An empty output indicates that the StackLight logging stack is disabled.

kubectl get all -n stacklight -l 'app in
(opensearch-master,opensearchDashboards,fluentd-logs)'

logging.externalOutputs

  1. Verify the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional output stream according to configured external outputs.

  2. After restart of the fluentd-logs pods, verify that their logs do not contain any delivery error messages. For example:

    kubectl logs -n stacklight -f <fluentd-logs-pod-name>| grep '\[error\]'
    

    Example output with a missing parameter:

    [...]
    2023-07-25 09:39:33 +0000 [error]: config error file="/etc/fluentd/fluent.conf" error_class=Fluent::ConfigError error="host or host_with_port is required"
    

    If a parameter is missing, verify the configuration as described in Enable log forwarding to external destinations.

  3. Verify that the log messages are appearing in the external server database.

To troubleshoot issues with Splunk, refer to No logs are forwarded to Splunk.

logging.externalOutputSecretMounts

Verify that files were created for the specified path in the Fluentd container:

kubectl get pods -n stacklight -o name | grep fluentd-logs | \
xargs -I{} kubectl exec -i {} -c fluentd-logs -n stacklight -- \
ls <logging.externalOutputSecretMounts.mountPath>

logging.extraConfig

Verify that the customization has applied:

kubectl -n stacklight get cm opensearch-master-config -o=jsonpath='{.data}'

Example of system response:

{"opensearch.yml":"cluster.name: opensearch\
\nnetwork.host: 0.0.0.0\
\nplugins.security.disabled: true\
\nplugins.index_state_management.enabled: false\
\npath.data: /usr/share/opensearch/data\
\ncompatibility.override_main_response_version: true\
\ncluster.max_shards_per_node: 5000\n"}
logging.level
Removed in MCC 2.26.0 (17.1.0, 16.1.0)
  1. Inspect the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs \
    -o "jsonpath={.data['output-logs\.conf']}"
    
  2. Grep the output using the following command. The pattern should contain all logging levels below the expected one.

    @type grep
    <exclude>
     key severity_label
     pattern /^<pattern>$/
    </exclude>
    

logging.metricQueries

For details, see steps 4.2 and 4.3 in Create logs-based metrics.

logging.syslog.enabled

  1. Verify the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional container with the remote syslog configuration.

  2. After restart of the fluentd-logs pods, verify that their logs do not contain any delivery error messages.

  3. Verify that the log messages are appearing in the remote syslog database.

logging.syslog.packetSize

Verify that the packetSize has changed according to your customization:

kubectl get cm -n stacklight fluentd-logs -o \
yaml | grep packet_size

metricFilter

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the following fields in the metric_relabel_configs section for the kubernetes-nodes-cadvisor and prometheus-kube-state-metrics scrape jobs have the required configuration:

    • action is set to keep or drop

    • regex contains a regular expression with configured namespaces delimited by |

    • source_labels is set to [namespace]

mke.dockerdDataRoot

In the Prometheus web UI, navigate to Alerts and verify that the MKEAPIDown is not false-positively firing due to the certificate absence.

mke.enabled

  1. In the Grafana web UI, verify that the MKE Cluster and MKE Containers dashboards are present and not empty.

  2. In the Prometheus web UI, navigate to Alerts and verify that the MKE* alerts are present in the list of alerts.

nodeExporter.extraCollectorsEnabled

In the Prometheus web UI, run the following PromQL queries. The result should not be empty.

node_scrape_collector_duration_seconds{collector="<COLLECTOR_NAME>"}
node_scrape_collector_success{collector="<COLLECTOR_NAME>"}

nodeExporter.netDeviceExclude

  1. Verify the DaemonSet configuration of the Node Exporter:

    kubectl get daemonset -n stacklight prometheus-node-exporter \
    -o=jsonpath='{.spec.template.spec.containers[0].args} | jq .
    

    Expected system response:

    [
      "--path.procfs=/host/proc",
      "--path.sysfs=/host/sys",
      "--collector.netclass.ignored-devices=<paste_your_excluding_regexp_here>",
      "--collector.netdev.device-blacklist=<paste_your_excluding_regexp_here>",
      "--no-collector.ipvs"
    ]
    
  2. In the Prometheus web UI, run the following PromQL query. The expected result is 1.

    absent(node_network_transmit_bytes_total{device=~"<paste_your_excluding_regexp_here>"})
    
  • nodeSelector.component

  • nodeSelector.default

  • tolerations.component

  • tolerations.default

Verify that the appropriate components pods are located on the intended nodes:

kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight
  • prometheusRelay.clientTimeout

  • prometheusRelay.responseLimitBytes

  1. Verify that the Prometheus Relay pod is up and running:

    kubectl get pods -n stacklight -l 'component=relay'
    
  2. Verify that the values have changed according to your customization:

    kubectl get pods -n stacklight prometheus-relay-9f87df558-zjpvn \
    -o=jsonpath='{.spec.containers[0].env}' | jq .
    

prometheusServer.alertsCommonLabels

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the alerting.alert_relabel_configs section contains the customization for common labels that you added in prometheusServer.alertsCommonLabels during StackLight configuration.

prometheusServer.customAlerts

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts has changed according to your customization.

prometheusServer.customRecordingRules

  1. In the Prometheus web UI, navigate to Status > Rules.

  2. Verify that the list of Prometheus recording rules has changed according to your customization.

prometheusServer.customScrapeConfigs

  1. In the Prometheus web UI, navigate to Status > Targets.

  2. Verify that the required target has appeared in the list of targets.

It may take up to 10 minutes for the change to apply.

prometheusServer.persistentVolumeClaimSize

Verify that the PVC(s) capacity equals or is higher (in case of statically provisioned volumes) than specified:

kubectl get pvc -n stacklight -l "app=prometheus,component=server"
  • prometheusServer.alertResendDelay

  • prometheusServer.queryConcurrency

  • prometheusServer.retentionSize

  • prometheusServer.retentionTime

  1. In the Prometheus web UI, navigate to Status > Command-Line Flags.

  2. Verify the values for the following flags:

    • rules.alert.resend-delay

    • query.max-concurrency

    • storage.tsdb.retention.size

    • storage.tsdb.retention.time

prometheusServer.remoteWrites

  1. Inspect the remote_write configuration in the Status > Configuration section of the Prometheus web UI.

  2. Inspect the Prometheus server logs for errors:

    kubectl logs prometheus-server-0 prometheus-server -n stacklight
    

prometheusServer.remoteWriteSecretMounts

Verify that files were created for the specified path in the Prometheus container:

kubectl exec -it prometheus-server-0 -c prometheus-server -n \
stacklight -- ls <remoteWriteSecretMounts.mountPath>

prometheusServer.watchDogAlertEnabled

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts contains the Watchdog alert.

  • sfReporter.cronjob

  • sfReporter.enabled

  • sfReporter.salesForce

  1. Verify that Salesforce reporter is enabled. The SUSPEND field in the output must be False.

    kubectl get cronjob -n stacklight
    
  2. Verify that the Salesforce reporter configuration includes all expected queries:

    kubectl get configmap -n stacklight \
    sf-reporter-config -o yaml
    
  3. After cron job execution (by default, at midnight server time), obtain the Salesforce reporter pod name. The output should include the Salesforce reporter pod name and STATUS must be Completed.

    kubectl get pods -n stacklight
    
  4. Verify that Salesforce reporter successfully authenticates to Salesforce and creates records. The output must include the Salesforce authentication successful, Created record or Duplicate record and Updated record lines.

    kubectl logs -n stacklight <sf-reporter-pod-name>
    
  • sslCertificateMonitoring.domains

  • sslCertificateMonitoring.enabled

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox target contains the configured domains (URLs).

  • storage.componentStorageClasses

  • storage.defaultStorageClass

Verify that the appropriate components PVCs have been created according to the configured StorageClass:

kubectl get pvc -n stacklight