Verify StackLight after configuration

This section describes how to verify StackLight after configuring its parameters as described in Configure StackLight and StackLight configuration parameters. Perform the verification procedure described for a particular modified StackLight key.

To verify StackLight after configuration:

Key

Verification procedure

alerta.enabled

Verify that Alerta is present in the list of StackLight resources. An empty output indicates that Alerta is disabled.

kubectl get all -n stacklight -l app=alerta
  • elasticsearch.retentionTime

  • logging.retentionTime
    Removed in 2.26.0 (17.1.0, 16.1.0)
  1. Verify that configMap includes the new data. The output should include the changed values.

    kubectl get cm elasticsearch-curator-config -n stacklight --kubeconfig=<pathToKubeconfig> -o yaml
    
  2. Verify that the elasticsearch-curator-{JOB_ID}-{POD_ID} job has successfully completed:

    kubectl logs elasticsearch-curator-<jobID>-<podID> -n stacklight --kubeconfig=<pathToKubeconfig>
    
elasticsearch.logstashRetentionTime
Removed in 2.26.0 (17.1.0, 16.1.0)

Verify that the unit_count parameter contains the desired number of days:

kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'

elasticsearch.persistentVolumeClaimSize

Verify that the PVC(s) capacity equals or is higher (in case of statically provisioned volumes) than specified:

kubectl get pvc -n stacklight -l "app=opensearch-master"

grafana.renderer.enabled

Verify the Grafana Image Renderer. If set to true, the output should include HTTP Server started, listening at http://localhost:8081.

kubectl logs -f -n stacklight -l app=grafana \
--container grafana-renderer

grafana.homeDashboard

In the Grafana web UI, verify that the desired dashboard is set as a home dashboard.

logging.enabled

Verify that OpenSearch, Fluentd, and OpenSearch Dashboards are present in the list of StackLight resources. An empty output indicates that the StackLight logging stack is disabled.

kubectl get all -n stacklight -l 'app in
(opensearch-master,opensearchDashboards,fluentd-logs)'
logging.level
Removed in 2.26.0 (17.1.0, 16.1.0)
  1. Inspect the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs \
    -o "jsonpath={.data['output-logs\.conf']}"
    
  2. Grep the output using the following command. The pattern should contain all logging levels below the expected one.

    @type grep
    <exclude>
     key severity_label
     pattern /^<pattern>$/
    </exclude>
    

logging.externalOutputs

  1. Verify the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional output stream according to configured external outputs.

  2. After restart of the fluentd-logs pods, verify that their logs do not contain any delivery error messages. For example:

    kubectl logs -n stacklight -f <fluentd-logs-pod-name>| grep '\[error\]'
    

    Example output with a missing parameter:

    [...]
    2023-07-25 09:39:33 +0000 [error]: config error file="/etc/fluentd/fluent.conf" error_class=Fluent::ConfigError error="host or host_with_port is required"
    

    If a parameter is missing, verify the configuration as described in Enable log forwarding to external destinations.

  3. Verify that the log messages are appearing in the external server database.

To troubleshoot issues with Splunk, refer to No logs are forwarded to Splunk.

logging.externalOutputSecretMounts

Verify that files were created for the specified path in the Fluentd container:

kubectl get pods -n stacklight -o name | grep fluentd-logs | \
xargs -I{} kubectl exec -i {} -c fluentd-logs -n stacklight -- \
ls <logging.externalOutputSecretMounts.mountPath>

logging.syslog.enabled

  1. Verify the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional container with the remote syslog configuration.

  2. After restart of the fluentd-logs pods, verify that their logs do not contain any delivery error messages.

  3. Verify that the log messages are appearing in the remote syslog database.

logging.syslog.packetSize

Verify that the packetSize has changed according to your customization:

kubectl get cm -n stacklight fluentd-logs -o \
yaml | grep packet_size

logging.metricQueries

For details, see steps 4.2 and 4.3 in Create logs-based metrics.

logging.extraConfig

Verify that the customization has applied:

kubectl -n stacklight get cm opensearch-master-config -o=jsonpath='{.data}'

Example of system response:

{"opensearch.yml":"cluster.name: opensearch\
\nnetwork.host: 0.0.0.0\
\nplugins.security.disabled: true\
\nplugins.index_state_management.enabled: false\
\npath.data: /usr/share/opensearch/data\
\ncompatibility.override_main_response_version: true\
\ncluster.max_shards_per_node: 5000\n"}

logging.dashboardsExtraConfig

Verify that the customization has applied:

kubectl -n stacklight get cm opensearch-dashboards -o=jsonpath='{.data}'

Example of system response:

{"opensearch_dashboards.yml":"opensearch.hosts: http://opensearch-master:9200\
\nopensearch.requestTimeout: 60000\
\nopensearchDashboards.defaultAppId: dashboard/2d53aa40-ad1f-11e9-9839-052bda0fdf49\
\nserver:\
\n  host: 0.0.0.0\
\n  name: opensearch-dashboards\n"}

highAvailabilityEnabled

Verify the number of service replicas for the HA or non-HA StackLight mode. For details, see Deployment architecture.

kubectl get sts -n stacklight
  • prometheusServer.queryConcurrency

  • prometheusServer.retentionTime

  • prometheusServer.retentionSize

  • prometheusServer.alertResendDelay

  1. In the Prometheus web UI, navigate to Status > Command-Line Flags.

  2. Verify the values for the following flags:

    • query.max-concurrency

    • storage.tsdb.retention.time

    • storage.tsdb.retention.size

    • rules.alert.resend-delay

prometheusServer.alertsCommonLabels

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the alerting.alert_relabel_configs section contains the customization for common labels that you added in prometheusServer.alertsCommonLabels during StackLight configuration.

prometheusServer.persistentVolumeClaimSize

Verify that the PVC(s) capacity equals or is higher (in case of statically provisioned volumes) than specified:

kubectl get pvc -n stacklight -l "app=prometheus,component=server"

prometheusServer.customRecordingRules

  1. In the Prometheus web UI, navigate to Status > Rules.

  2. Verify that the list of Prometheus recording rules has changed according to your customization.

prometheusServer.customScrapeConfigs

  1. In the Prometheus web UI, navigate to Status > Targets.

  2. Verify that the required target has appeared in the list of targets.

It may take up to 10 minutes for the change to apply.

prometheusServer.remoteWriteSecretMounts

Verify that files were created for the specified path in the Prometheus container:

kubectl exec -it prometheus-server-0 -c prometheus-server -n \
stacklight -- ls <remoteWriteSecretMounts.mountPath>

prometheusServer.remoteWrites

  1. Inspect the remote_write configuration in the Status > Configuration section of the Prometheus web UI.

  2. Inspect the Prometheus server logs for errors:

    kubectl logs prometheus-server-0 prometheus-server -n stacklight
    
  • prometheusRelay.clientTimeout

  • prometheusRelay.responseLimitBytes

  1. Verify that the Prometheus Relay pod is up and running:

    kubectl get pods -n stacklight -l 'component=relay'
    
  2. Verify that the values have changed according to your customization:

    kubectl get pods -n stacklight prometheus-relay-9f87df558-zjpvn \
    -o=jsonpath='{.spec.containers[0].env}' | jq .
    
  • clusterSize

  • resourcesPerClusterSize

  • resources

  1. Obtain the list of pods:

    kubectl get po -n stacklight
    
  2. Verify that the desired resource limits or requests are set in the resources section of every container in the pod:

    kubectl get po <pod_name> -n stacklight -o yaml
    
  • nodeSelector.default

  • nodeSelector.component

  • tolerations.default

  • tolerations.component

Verify that the appropriate components pods are located on the intended nodes:

kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight

nodeExporter.netDeviceExclude

  1. Verify the DaemonSet configuration of the Node Exporter:

    kubectl get daemonset -n stacklight prometheus-node-exporter \
    -o=jsonpath='{.spec.template.spec.containers[0].args} | jq .
    

    Expected system response:

    [
      "--path.procfs=/host/proc",
      "--path.sysfs=/host/sys",
      "--collector.netclass.ignored-devices=<paste_your_excluding_regexp_here>",
      "--collector.netdev.device-blacklist=<paste_your_excluding_regexp_here>",
      "--no-collector.ipvs"
    ]
    
  2. In the Prometheus web UI, run the following PromQL query. The expected result is 1.

    absent(node_network_transmit_bytes_total{device=~"<paste_your_excluding_regexp_here>"})
    

nodeExporter.extraCollectorsEnabled

In the Prometheus web UI, run the following PromQL queries. The result should not be empty.

node_scrape_collector_duration_seconds{collector="<COLLECTOR_NAME>"}
node_scrape_collector_success{collector="<COLLECTOR_NAME>"}

blackboxExporter.customModules

  1. Verify that your module is present in the list of modules. It can take up to 10 minutes for the module to appear in the ConfigMap.

    kubectl get cm prometheus-blackbox-exporter -n stacklight \
    -o=jsonpath='{.data.blackbox\.yaml}'
    
  2. Review the configmap-reload container logs to verify that the reload happened successfully. It can take up to 1 minute for reload to happen after the module appears in the ConfigMap.

    kubectl logs -l app=prometheus-blackbox-exporter -n stacklight -c \
    configmap-reload
    

blackboxExporter.timeoutOffset

Verify that the args parameter of the blackbox-exporter container contains the specified --timeout-offset:

kubectl get deployment.apps/prometheus-blackbox-exporter -n stacklight \
-o=jsonpath='{.spec.template.spec.containers[?(@.name=="blackbox-exporter")].args}'

For example, for blackboxExporter.timeoutOffset set to 0.1, the output should include ["--config.file=/config/blackbox.yaml","--timeout-offset=0.1"]. It can take up to 10 minutes for the parameter to be populated.

  • storage.defaultStorageClass

  • storage.componentStorageClasses

Verify that the appropriate components PVCs have been created according to the configured StorageClass:

kubectl get pvc -n stacklight
  • refapp.enabled Available since 2.21.0 for non-MOSK clusters

  • refapp.workload.storageClassName

  1. In the Grafana web UI, verify that Reference Application dashboard exists and data is available on graphs.

  2. Verify that MariaDB PVCs are allocated according to the configured StorageClass:

    kubectl get pvc -n stacklight
    
  • sfReporter.enabled

  • sfReporter.salesForce

  • sfReporter.cronjob

  1. Verify that Salesforce reporter is enabled. The SUSPEND field in the output must be False.

    kubectl get cronjob -n stacklight
    
  2. Verify that the Salesforce reporter configuration includes all expected queries:

    kubectl get configmap -n stacklight \
    sf-reporter-config -o yaml
    
  3. After cron job execution (by default, at midnight server time), obtain the Salesforce reporter pod name. The output should include the Salesforce reporter pod name and STATUS must be Completed.

    kubectl get pods -n stacklight
    
  4. Verify that Salesforce reporter successfully authenticates to Salesforce and creates records. The output must include the Salesforce authentication successful, Created record or Duplicate record and Updated record lines.

    kubectl logs -n stacklight <sf-reporter-pod-name>
    

ceph.enabled

  1. In the Grafana web UI, verify that Ceph dashboards are present in the list of dashboards and are populated with data.

  2. In the Prometheus web UI, click Alerts and verify that the list of alerts contains Ceph* alerts.

  • externalEndpointMonitoring.enabled

  • externalEndpointMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

  • ironic.endpoint

  • ironic.insecure

In the Grafana web UI, verify that the Ironic BM dashboard displays valuable data (no false-positive or empty panels).

metricFilter

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the following fields in the metric_relabel_configs section for the kubernetes-nodes-cadvisor and prometheus-kube-state-metrics scrape jobs have the required configuration:

    • action is set to keep or drop

    • regex contains a regular expression with configured namespaces delimited by |

    • source_labels is set to [namespace]

  • sslCertificateMonitoring.enabled

  • sslCertificateMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox target contains the configured domains (URLs).

mke.enabled

  1. In the Grafana web UI, verify that the MKE Cluster and MKE Containers dashboards are present and not empty.

  2. In the Prometheus web UI, navigate to Alerts and verify that the MKE* alerts are present in the list of alerts.

mke.dockerdDataRoot

In the Prometheus web UI, navigate to Alerts and verify that the MKEAPIDown is not false-positively firing due to the certificate absence.

prometheusServer.customAlerts

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts has changed according to your customization.

prometheusServer.watchDogAlertEnabled

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts contains the Watchdog alert.

alertmanagerSimpleConfig.genericReceivers

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended receiver(s).

alertmanagerSimpleConfig.genericRoutes

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended route(s).

alertmanagerSimpleConfig.inhibitRules.enabled

Run the following command. An empty output indicates either a failure or that the feature is disabled.

kubectl  get cm -n stacklight prometheus-alertmanager -o \
yaml | grep -A 6 inhibit_rules
  • alertmanagerSimpleConfig.email.enabled

  • alertmanagerSimpleConfig.email

  • alertmanagerSimpleConfig.email.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the Email receiver and route.

  • alertmanagerSimpleConfig.salesForce.enabled

  • alertmanagerSimpleConfig.salesForce.auth

  • alertmanagerSimpleConfig.salesForce.route

  1. Verify that sf-notifier is enabled. The output must include the sf-notifier pod name, 1/1 in the READY field and Running in the STATUS field.

    kubectl get pods -n stacklight
    
  2. Verify that sf-notifier successfully authenticates to Salesforce. The output must include the Salesforce authentication successful line.

    kubectl logs -f -n stacklight <sf-notifier-pod-name>
    
  3. In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-salesforce receiver and route.

alertmanagerSimpleConfig.salesForce.feed_enabled

  • Verify that the sf-notifier pod logs include Creating feed item messages. For such messages to appear in logs, DEBUG logging level must be set up.

  • Verify through Salesforce:

    1. Log in to the Salesforce web UI.

    2. Click the Feed tab for a case created by sf-notifier.

    3. Verify that All Messages gets updated.

alertmanagerSimpleConfig.salesForce.link_prometheus

Verify that SF_NOTIFIER_ADD_LINKS has changed to true or false according to your customization:

kubectl get deployment sf-notifier \
-o=jsonpath='{.spec.template.spec.containers[0].env}' | jq .
  • alertmanagerSimpleConfig.slack.enabled

  • alertmanagerSimpleConfig.slack.api_url

  • alertmanagerSimpleConfig.slack.channel

  • alertmanagerSimpleConfig.slack.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-slack receiver and route.

  • alertmanagerSimpleConfig.msteams.enabled

  • alertmanagerSimpleConfig.msteams.url

  • alertmanagerSimpleConfig.msteams.route

  1. Verify that the Prometheus Microsoft Teams pod is up and running:

    kubectl get pods -n stacklight -l \
    'app=prometheus-msteams'
    
  2. Verify that the Prometheus Microsoft Teams pod logs have no errors:

    kubectl logs -f -n stacklight -l \
    'app=prometheus-msteams'
    
  3. Verify that notifications are being sent to the Microsoft Teams channel.

alertmanagerSimpleConfig.serviceNow

  1. Verify that the alertmanager-webhook-servicenow pod is up and running:

    kubectl get pods -n stacklight -l \
    'app=alertmanager-webhook-servicenow'
    
  2. Verify that authentication to ServiceNow was successful. The output should include ServiceNow authentication successful. In case of authentication failure, the ServiceNowAuthFailure alert will raise.

    kubectl logs -f -n stacklight \
    <alertmanager-webhook-servicenow-pod-name>
    
  3. In your ServiceNow instance, verify that the Watchdog alert appears in the Incident table. Once the incident is created, the pod logs should include a line similar to Created Incident: bef260671bdb2010d7b540c6cc4bcbed.

In case of any failure:

  • Verify that your ServiceNow instance is not in hibernation.

  • Verify that the service user credentials, table name, and alert_id_field are correct.

  • Verify that the ServiceNow user has access to the table with permission to read, create, and update records.