alerta.enabled
|
Verify that Alerta is present in the list of StackLight resources. An
empty output indicates that Alerta is disabled.
kubectl get all -n stacklight -l app=alerta
|
alertmanagerSimpleConfig.email
alertmanagerSimpleConfig.email.enabled
alertmanagerSimpleConfig.email.route
|
In the Alertmanager web UI, navigate to Status and verify
that the Config section contains the Email receiver and
route. |
alertmanagerSimpleConfig.genericReceivers
|
In the Alertmanager web UI, navigate to Status and verify
that the Config section contains the intended receiver(s). |
alertmanagerSimpleConfig.genericRoutes
|
In the Alertmanager web UI, navigate to Status and verify
that the Config section contains the intended route(s). |
alertmanagerSimpleConfig.inhibitRules.enabled
|
Run the following command. An empty output indicates either a failure
or that the feature is disabled.
kubectl get cm -n stacklight prometheus-alertmanager -o \
yaml | grep -A 6 inhibit_rules
|
alertmanagerSimpleConfig.msteams.enabled
alertmanagerSimpleConfig.msteams.url
alertmanagerSimpleConfig.msteams.route
|
Verify that the Prometheus Microsoft Teams pod is up and running:
kubectl get pods -n stacklight -l \
'app=prometheus-msteams'
Verify that the Prometheus Microsoft Teams pod logs have no errors:
kubectl logs -f -n stacklight -l \
'app=prometheus-msteams'
Verify that notifications are being sent to the Microsoft Teams
channel.
|
alertmanagerSimpleConfig.salesForce.enabled
alertmanagerSimpleConfig.salesForce.auth
alertmanagerSimpleConfig.salesForce.route
|
Verify that sf-notifier is enabled. The output must include the
sf-notifier pod name, 1/1 in the READY field and
Running in the STATUS field.
kubectl get pods -n stacklight
Verify that sf-notifier successfully authenticates to Salesforce.
The output must include the Salesforce authentication successful
line.
kubectl logs -f -n stacklight <sf-notifier-pod-name>
In the Alertmanager web UI, navigate to Status and verify
that the Config section contains the HTTP-salesforce
receiver and route.
|
alertmanagerSimpleConfig.salesForce.feed_enabled
|
Verify that the sf-notifier pod logs include Creating feed item
messages. For such messages to appear in logs, DEBUG logging level
must be set up.
Verify through Salesforce:
Log in to the Salesforce web UI.
Click the Feed tab for a case created by
sf-notifier .
Verify that All Messages gets updated.
|
alertmanagerSimpleConfig.salesForce.link_prometheus
|
Verify that SF_NOTIFIER_ADD_LINKS has changed to true or
false according to your customization:
kubectl get deployment sf-notifier \
-o=jsonpath='{.spec.template.spec.containers[0].env}' | jq .
|
alertmanagerSimpleConfig.serviceNow
|
Verify that the alertmanager-webhook-servicenow pod is up and
running:
kubectl get pods -n stacklight -l \
'app=alertmanager-webhook-servicenow'
Verify that authentication to ServiceNow was successful. The output
should include ServiceNow authentication successful. In case of
authentication failure, the ServiceNowAuthFailure alert will
raise.
kubectl logs -f -n stacklight \
<alertmanager-webhook-servicenow-pod-name>
In your ServiceNow instance, verify that the Watchdog
alert appears in the Incident table. Once the incident is
created, the pod logs should include a line similar to
Created Incident: bef260671bdb2010d7b540c6cc4bcbed.
In case of any failure:
Verify that your ServiceNow instance is not in hibernation.
Verify that the service user credentials, table name, and
alert_id_field are correct.
Verify that the ServiceNow user has access to the table with
permission to read, create, and update records.
|
alertmanagerSimpleConfig.slack.enabled
alertmanagerSimpleConfig.slack.api_url
alertmanagerSimpleConfig.slack.channel
alertmanagerSimpleConfig.slack.route
|
In the Alertmanager web UI, navigate to Status and verify
that the Config section contains the HTTP-slack receiver
and route. |
blackboxExporter.customModules
|
Verify that your module is present in the list of modules. It can
take up to 10 minutes for the module to appear in the ConfigMap.
kubectl get cm prometheus-blackbox-exporter -n stacklight \
-o=jsonpath='{.data.blackbox\.yaml}'
Review the configmap-reload container logs to verify that the
reload happened successfully. It can take up to 1 minute for reload
to happen after the module appears in the ConfigMap.
kubectl logs -l app=prometheus-blackbox-exporter -n stacklight -c \
configmap-reload
|
blackboxExporter.timeoutOffset
|
Verify that the args parameter of the blackbox-exporter
container contains the specified --timeout-offset :
kubectl get deployment.apps/prometheus-blackbox-exporter -n stacklight \
-o=jsonpath='{.spec.template.spec.containers[?(@.name=="blackbox-exporter")].args}'
For example, for blackboxExporter.timeoutOffset set to 0.1 , the
output should include
["--config.file=/config/blackbox.yaml","--timeout-offset=0.1"] .
It can take up to 10 minutes for the parameter to be populated.
|
ceph.enabled
|
In the Grafana web UI, verify that Ceph dashboards are present in the
list of dashboards and are populated with data.
In the Prometheus web UI, click Alerts and verify that
the list of alerts contains Ceph* alerts.
|
|
Obtain the list of pods:
kubectl get po -n stacklight
Verify that the desired resource limits or requests are set in the
resources section of every container in the pod:
kubectl get po <pod_name> -n stacklight -o yaml
|
elasticsearch.logstashRetentionTime
Removed in MCC 2.26.0 (17.1.0, 16.1.0)
|
Verify that the unit_count parameter contains the desired number of
days:
kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'
|
elasticsearch.persistentVolumeClaimSize
|
Verify that the PVC(s) capacity is equal or higher (in case of
statically provisioned volumes) than specified:
kubectl get pvc -n stacklight -l "app=opensearch-master"
|
|
Verify that configMap includes the new data. The output should
include the changed values.
kubectl get cm elasticsearch-curator-config -n stacklight --kubeconfig=<pathToKubeconfig> -o yaml
Verify that the elasticsearch-curator-{JOB_ID}-{POD_ID} job has
successfully completed:
kubectl logs elasticsearch-curator-<jobID>-<podID> -n stacklight --kubeconfig=<pathToKubeconfig>
|
|
In the Prometheus web UI, navigate to Status -> Targets.
Verify that the blackbox-external-endpoint target contains the
configured domains (URLs).
|
grafana.homeDashboard
|
In the Grafana web UI, verify that the desired dashboard is set as a
home dashboard. |
grafana.renderer.enabled
Removed in MCC 2.27.0 (17.2.0, 16.2.0)
|
Verify the Grafana Image Renderer. If set to true , the output should
include HTTP Server started, listening at http://localhost:8081 .
kubectl logs -f -n stacklight -l app=grafana \
--container grafana-renderer
|
highAvailabilityEnabled
|
Verify the number of service replicas for the HA or non-HA StackLight
mode. For details, see Deployment architecture.
kubectl get sts -n stacklight
|
ironic.endpoint
ironic.insecure
|
In the Grafana web UI, verify that the Ironic BM dashboard
displays valuable data (no false-positive or empty panels). |
logging.dashboardsExtraConfig
|
Verify that the customization has applied:
kubectl -n stacklight get cm opensearch-dashboards -o=jsonpath='{.data}'
Example of system response:
{"opensearch_dashboards.yml":"opensearch.hosts: http://opensearch-master:9200\
\nopensearch.requestTimeout: 60000\
\nopensearchDashboards.defaultAppId: dashboard/2d53aa40-ad1f-11e9-9839-052bda0fdf49\
\nserver:\
\n host: 0.0.0.0\
\n name: opensearch-dashboards\n"}
|
logging.enabled
|
Verify that OpenSearch, Fluentd, and OpenSearch Dashboards are present
in the list of StackLight resources. An empty output indicates that the
StackLight logging stack is disabled.
kubectl get all -n stacklight -l 'app in
(opensearch-master,opensearchDashboards,fluentd-logs)'
|
logging.externalOutputs
|
Verify the fluentd-logs Kubernetes configmap in the
stacklight namespace:
kubectl get cm -n stacklight fluentd-logs -o \
"jsonpath={.data['output-logs\.conf']}"
The output must contain an additional output stream according to
configured external outputs.
After restart of the fluentd-logs pods, verify that
their logs do not contain any delivery error messages. For example:
kubectl logs -n stacklight -f <fluentd-logs-pod-name>| grep '\[error\]'
Example output with a missing parameter:
[...]
2023-07-25 09:39:33 +0000 [error]: config error file="/etc/fluentd/fluent.conf" error_class=Fluent::ConfigError error="host or host_with_port is required"
If a parameter is missing, verify the configuration as described in
Enable log forwarding to external destinations.
Verify that the log messages are appearing in the external server
database.
To troubleshoot issues with Splunk, refer to No logs are forwarded to Splunk.
|
logging.externalOutputSecretMounts
|
Verify that files were created for the specified path in the Fluentd
container:
kubectl get pods -n stacklight -o name | grep fluentd-logs | \
xargs -I{} kubectl exec -i {} -c fluentd-logs -n stacklight -- \
ls <logging.externalOutputSecretMounts.mountPath>
|
logging.extraConfig
|
Verify that the customization has applied:
kubectl -n stacklight get cm opensearch-master-config -o=jsonpath='{.data}'
Example of system response:
{"opensearch.yml":"cluster.name: opensearch\
\nnetwork.host: 0.0.0.0\
\nplugins.security.disabled: true\
\nplugins.index_state_management.enabled: false\
\npath.data: /usr/share/opensearch/data\
\ncompatibility.override_main_response_version: true\
\ncluster.max_shards_per_node: 5000\n"}
|
logging.level
Removed in MCC 2.26.0 (17.1.0, 16.1.0)
|
Inspect the fluentd-logs Kubernetes configmap in the
stacklight namespace:
kubectl get cm -n stacklight fluentd-logs \
-o "jsonpath={.data['output-logs\.conf']}"
Grep the output using the following command. The pattern should
contain all logging levels below the expected one.
@type grep
<exclude>
key severity_label
pattern /^<pattern>$/
</exclude>
|
logging.metricQueries
|
For details, see steps 4.2 and 4.3 in Create logs-based metrics. |
logging.syslog.enabled
|
Verify the fluentd-logs Kubernetes configmap in the
stacklight namespace:
kubectl get cm -n stacklight fluentd-logs -o \
"jsonpath={.data['output-logs\.conf']}"
The output must contain an additional container with the remote
syslog configuration.
After restart of the fluentd-logs pods, verify that
their logs do not contain any delivery error messages.
Verify that the log messages are appearing in the remote syslog
database.
|
logging.syslog.packetSize
|
Verify that the packetSize has changed according to your
customization:
kubectl get cm -n stacklight fluentd-logs -o \
yaml | grep packet_size
|
metricFilter
|
In the Prometheus web UI, navigate to
Status > Configuration.
Verify that the following fields in the metric_relabel_configs
section for the kubernetes-nodes-cadvisor and
prometheus-kube-state-metrics scrape jobs have the required
configuration:
action is set to keep or drop
regex contains a regular expression with configured namespaces
delimited by |
source_labels is set to [namespace]
|
mke.dockerdDataRoot
|
In the Prometheus web UI, navigate to Alerts and verify that
the MKEAPIDown is not false-positively firing due to the
certificate absence. |
mke.enabled
|
In the Grafana web UI, verify that the MKE Cluster and
MKE Containers dashboards are present and not empty.
In the Prometheus web UI, navigate to Alerts and verify
that the MKE* alerts are present in the list of alerts.
|
nodeExporter.extraCollectorsEnabled
|
In the Prometheus web UI, run the following PromQL queries. The result
should not be empty.
node_scrape_collector_duration_seconds{collector="<COLLECTOR_NAME>"}
node_scrape_collector_success{collector="<COLLECTOR_NAME>"}
|
nodeExporter.netDeviceExclude
|
Verify the DaemonSet configuration of the Node Exporter:
kubectl get daemonset -n stacklight prometheus-node-exporter \
-o=jsonpath='{.spec.template.spec.containers[0].args} | jq .
Expected system response:
[
"--path.procfs=/host/proc",
"--path.sysfs=/host/sys",
"--collector.netclass.ignored-devices=<paste_your_excluding_regexp_here>",
"--collector.netdev.device-blacklist=<paste_your_excluding_regexp_here>",
"--no-collector.ipvs"
]
In the Prometheus web UI, run the following PromQL query. The
expected result is 1 .
absent(node_network_transmit_bytes_total{device=~"<paste_your_excluding_regexp_here>"})
|
nodeSelector.component
nodeSelector.default
tolerations.component
tolerations.default
|
Verify that the appropriate components pods are located on the intended
nodes:
kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight
|
|
Verify that the Prometheus Relay pod is up and running:
kubectl get pods -n stacklight -l 'component=relay'
Verify that the values have changed according to your customization:
kubectl get pods -n stacklight prometheus-relay-9f87df558-zjpvn \
-o=jsonpath='{.spec.containers[0].env}' | jq .
|
prometheusServer.alertsCommonLabels
|
In the Prometheus web UI, navigate to Status > Configuration.
Verify that the alerting.alert_relabel_configs section contains
the customization for common labels that you added in
prometheusServer.alertsCommonLabels during StackLight configuration.
|
prometheusServer.customAlerts
|
In the Prometheus web UI, navigate to Alerts and verify that
the list of alerts has changed according to your customization. |
prometheusServer.customRecordingRules
|
In the Prometheus web UI, navigate to Status > Rules.
Verify that the list of Prometheus recording rules has changed
according to your customization.
|
prometheusServer.customScrapeConfigs
|
In the Prometheus web UI, navigate to Status > Targets.
Verify that the required target has appeared in the list of targets.
It may take up to 10 minutes for the change to apply.
|
prometheusServer.persistentVolumeClaimSize
|
Verify that the PVC(s) capacity equals or is higher (in case of
statically provisioned volumes) than specified:
kubectl get pvc -n stacklight -l "app=prometheus,component=server"
|
prometheusServer.alertResendDelay
prometheusServer.queryConcurrency
prometheusServer.retentionSize
prometheusServer.retentionTime
|
In the Prometheus web UI, navigate to
Status > Command-Line Flags.
Verify the values for the following flags:
|
prometheusServer.remoteWrites
|
Inspect the remote_write configuration in the
Status > Configuration section of the Prometheus web UI.
Inspect the Prometheus server logs for errors:
kubectl logs prometheus-server-0 prometheus-server -n stacklight
|
prometheusServer.remoteWriteSecretMounts
|
Verify that files were created for the specified path in the Prometheus
container:
kubectl exec -it prometheus-server-0 -c prometheus-server -n \
stacklight -- ls <remoteWriteSecretMounts.mountPath>
|
prometheusServer.watchDogAlertEnabled
|
In the Prometheus web UI, navigate to Alerts and verify that
the list of alerts contains the Watchdog alert. |
sfReporter.cronjob
sfReporter.enabled
sfReporter.salesForce
|
Verify that Salesforce reporter is enabled. The SUSPEND field in
the output must be False .
kubectl get cronjob -n stacklight
Verify that the Salesforce reporter configuration includes all
expected queries:
kubectl get configmap -n stacklight \
sf-reporter-config -o yaml
After cron job execution (by default, at midnight server time),
obtain the Salesforce reporter pod name. The output should include
the Salesforce reporter pod name and STATUS must be
Completed .
kubectl get pods -n stacklight
Verify that Salesforce reporter successfully authenticates to
Salesforce and creates records. The output must include the
Salesforce authentication successful, Created record or
Duplicate record and Updated record lines.
kubectl logs -n stacklight <sf-reporter-pod-name>
|
|
In the Prometheus web UI, navigate to Status -> Targets.
Verify that the blackbox target contains the configured domains
(URLs).
|
|
Verify that the appropriate components PVCs have been created according
to the configured StorageClass :
kubectl get pvc -n stacklight
|