StackLight configuration parameters

This section describes the StackLight configuration keys that you can specify in the values section to change StackLight settings as required. Prior to making any changes to StackLight configuration, perform the steps described in StackLight configuration procedure. After changing StackLight configuration, verify the changes as described in Verify StackLight after configuration.

Important

Some parameters are marked as mandatory. Failure to specify values for such parameters causes the Admission Controller to reject cluster creation.


Alerta

Key

Description

Example values

alerta.enabled (bool)

Enables or disables Alerta. Set to true by default.

true or false

Elasticsearch

Key

Description

Example values

elasticsearch.retentionTime (map)

Specifies the retention time per index. Includes the following parameters:

  • logstash - specifies the logstash-* index retention time.

  • events - specifies the kubernetes_events-* index retention time.

  • notifications - specifies the notification-* index retention time.

The allowed values include integers (days) and numbers with suffixes: y, m, w, d, h, including capital letters.

By default, values set in elasticsearch.logstashRetentionTime are used. However, the elasticsearch.retentionTime parameters, if defined, take precedence over elasticsearch.logstashRetentionTime.

elasticsearch:
  retentionTime:
    logstash: 3
    events: "2w"
    notifications: "1M"

elasticsearch.logstashRetentionTime (int) Deprecated since 2.16.0

Defines the Elasticsearch logstash-* index retention time in days. The logstash-* index stores all logs gathered from all nodes and containers. Set to 1 by default.

Note

Due to the known issue 27732-2, a custom setting for this parameter is dismissed during cluster deployment and changes to one day (default). Refer to the known issue description for the affected Cluster releases and available workaround.

1, 5, 15

elasticsearch.persistentVolumeClaimSize (string) Mandatory

Specifies the OpenSearch (Elasticsearch) PVC(s) size. The number of PVCs depends on the StackLight database mode. For HA, three PVCs will be created, each of the size specified in this parameter. For non-HA, one PVC of the specified size.

Important

You cannot modify this parameter after cluster creation.

Note

Due to the known issue 27732-1, the OpenSearch PVC size configuration is dismissed during a cluster deployment. Refer to the known issue description for affected Cluster releases and available workarounds.

elasticsearch:
  persistentVolumeClaimSize: 30Gi

Grafana

Key

Description

Example values

grafana.renderer.enabled (bool)

Disables Grafana Image Renderer. For example, for resource-limited environments. Enabled by default.

true or false

grafana.homeDashboard (string)

Defines the home dashboard. Set to kubernetes-cluster by default. You can define any of the available dashboards.

kubernetes-cluster

Logging

Key

Description

Example values

logging.enabled (bool) Mandatory

Enables or disables the StackLight logging stack. For details about the logging components, see Deployment architecture. Set to true by default. On management and regional clusters, true is mandatory.

true or false

logging.level (bool)

Sets the least important level of log messages to send to OpenSearch. Requires logging.enabled set to true.

The default logging level is INFO, meaning that StackLight will drop log messages for the lower DEBUG and TRACE levels. Levels from WARNING to EMERGENCY require attention.

Note

The FLUENTD_ERROR logs are of special type and cannot be dropped.

  • TRACE - the most verbose logs. Such level generates large amounts of data.

  • DEBUG- messages typically of use only for debugging purposes.

  • INFO - informational messages describing common processes such as service starting or stopping. Can be ignored during normal system operation but may provide additional input for investigation.

  • NOTICE - normal but significant conditions that may require special handling.

  • WARNING - messages on unexpected conditions that may require attention.

  • ERROR - messages on error conditions that prevent normal system operation and require action.

  • CRITICAL - messages on critical conditions indicating that a service is not working or working incorrectly.

  • ALERT - messages on severe events indicating that action is needed immediately.

  • EMERGENCY - messages indicating that a service is unusable.

logging.cerebro (bool)

Enables or disables Cerebro, a web UI for interacting with the OpenSearch cluster that stores logs. To access the Cerebro web UI, see Access OpenSearch clusters using Cerebro.

Note

Prior to enabling Cerebro, verify that your Container Cloud cluster has minimum 0.5-1 GB of free RAM and 1 vCPU available.

true or false

logging.metricQueries (map)

Allows configuring OpenSearch queries for the data present in OpenSearch. Prometheus Elasticsearch Exporter then queries the OpenSearch database and exposes such metrics in the Prometheus format. For details, see Create logs-based metrics. Includes the following parameters:

  • indices - specifies the index pattern

  • interval and timeout - specify in seconds how often to send the query to OpenSearch and how long it can last before timing out

  • onError and onMissing - modify the prometheus-es-exporter behavior on query error and missing index. For details, see Prometheus Elasticsearch Exporter.

For usage example, see Create logs-based metrics.

logging.retentionTime (map)

Specifies the retention time per index. Includes the following parameters:

  • logstash - specifies the logstash-* index retention time.

  • events - specifies the kubernetes_events-* index retention time.

  • notifications - specifies the notification-* index retention time.

The allowed values include integers (days) and numbers with suffixes: y, m, w, d, h, including capital letters.

logging:
  retentionTime:
    logstash: 3
    events: "2w"
    notifications: "1M"

Log verbosity

Key

Description

Example values

stacklightLogLevels.default (string)

Defines the log verbosity level for all StackLight components if not defined using component. To use the component default log verbosity level, leave the string empty.

  • trace - most verbose log messages, generates large amounts of data

  • debug - messages typically of use only for debugging purposes

  • info - informational messages describing common processes such as service starting or stopping; can be ignored during normal system operation but may provide additional input for investigation

  • warn - messages about conditions that may require attention

  • error - messages on error conditions that prevent normal system operation and require action

  • crit - messages on critical conditions indicating that a service is not working, working incorrectly or is unusable, requiring immediate attention

stacklightLogLevels.component (map)

Defines (overrides the default value) the log verbosity level for any StackLight component separately. To use the component default log verbosity, leave the string empty.

component:
  kubeStateMetrics: ""
  prometheusAlertManager: ""
  prometheusBlackboxExporter: ""
  prometheusNodeExporter: ""
  prometheusServer: ""
  alerta: ""
  alertmanagerWebhookServicenow: ""
  elasticsearchCurator: ""
  postgresql: ""
  prometheusEsExporter: ""
  sfNotifier: ""
  sfReporter: ""
  fluentd: ""
  # fluentdElasticsearch ""
  fluentdLogs: ""
  telemeterClient: ""
  telemeterServer: ""
  tfControllerExporter: ""
  tfVrouterExporter: ""
  telegrafDs: ""
  telegrafS: ""
  # elasticsearch: ""
  opensearch: ""
  # kibana: ""
  grafana: ""
  opensearchDashboards: ""
  metricbeat: ""
  prometheusMsTeams: ""

Logging to external outputs

Available since 2.23.0 and 2.23.1 for MOSK 23.1

Key

Description

Example values

logging.externalOutputs (map)

Specifies external Elasticsearch, OpenSearch, and syslog destinations as fluentd-logs outputs. Requires logging.enabled: true. For configuration procedure, see Enable log forwarding to external destinations.

logging:
  externalOutputs:
    elasticsearch:
      # disabled: false
      type: elasticsearch
      level: info
      plugin_log_level: info
      tag_exclude: '{fluentd-logs,systemd}'
      host: elasticsearch-host
      port: 9200
      logstash_date_format: '%Y.%m.%d'
      logstash_format: true
      logstash_prefix: logstash
      ...
      buffer:
        # disabled: false
        chunk_limit_size: 16m
        flush_interval: 15s
        flush_mode: interval
        overflow_action: block
        ...
    opensearch:
      disabled: true
      type: opensearch
      ...

Secrets for external log outputs

Available since 2.23.0 and 2.23.1 for MOSK 23.1

Key

Description

Example values

logging.externalOutputSecretMounts (map)

Specifies authentication secret mounts for external log destinations. Requires logging.externalOutputs to be enabled and a Kubernetes secret to be created under the stacklight namespace. Contains the following values:

  • secretName

    Mandatory. Kubernetes secret name.

  • mountPath

    Mandatory. Mount path of the Kubernetes secret defined in secretName.

  • defaultMode

    Optional. Decimal number defining secret permissions, 420 by default.

Secret mount configuration:

logging:
  externalOutputSecretMounts:
  - secretName: elasticsearch-certs
    mountPath: /tmp/elasticsearch-certs
    defaultMode: 420
  - secretName: opensearch-certs
    mountPath: /tmp/opensearch-certs

Elasticsearch configuration for the above secret mount:

logging:
  externalOutputs:
    elasticsearch:
      ...
      ca_file: /tmp/elasticsearch-certs/ca.pem
      client_cert: /tmp/elasticsearch-certs/client.pem
      client_key: /tmp/elasticsearch-certs/client.key
      client_key_pass: password

Logging to syslog

Deprecated since 2.23.0

Note

Since Container Cloud 2.23.0, logging.syslog is deprecated for the sake of logging.externalOutputs. For details, see Logging to external outputs.

Key

Description

Example values

logging.syslog.enabled (bool)

Enables or disables remote logging to syslog. Disabled by default. Requires logging.enabled set to true. For details and configuration example, see Enable remote logging to syslog.

true or false

logging.syslog.host (string)

Specifies the remote syslog host.

remote-syslog.svc

logging.syslog.port (string)

Specifies the remote syslog port.

514

logging.syslog.packetSize (string)

Defines the packet size in bytes for the syslog logging output. Set to 1024 by default. May be useful for syslog setups allowing packet size larger than 1 kB. Mirantis recommends that you tune this parameter to allow sending full log lines.

1024

logging.syslog.protocol (bool)

Specifies the remote syslog protocol. Set to udp by default.

tcp or udp

logging.syslog.tls.enabled (bool)

Optional. Disabled by default. Enables or disables TLS. Use TLS only for the TCP protocol. TLS will not be enabled if you set a protocol other than TCP.

true or false

logging.syslog.tls.verify_mode (int)

Optional. Configures TLS verification.

  • 0 for OpenSSL::SSL::VERIFY_NONE

  • 1 for OpenSSL::SSL::VERIFY_PEER

  • 2 for OpenSSL::SSL::VERIFY_FAIL_IF_NO_PEER_CERT

  • 4 for OpenSSL::SSL::VERIFY_CLIENT_ONCE

logging.syslog.tls.certificate (string)

Defines how to pass the certificate. secret takes precedence over hostPath.

  • secret - specifies the name of the secret holding the certificate.

  • hostPath - specifies an absolute host path to the PEM certificate.

certificate:
  secret: ""
  hostPath: "/etc/ssl/certs/ca-bundle.pem"

tag_exclude (string) Since 2.23.0

Optional. Overrides tag_include. Sets logs by tags to exclude from the destination output. For example, to exclude all logs with the test tag, set tag_exclude: '/.*test.*/'.

How to obtain tags for logs

Select from the following options:

  • In the main OpenSearch output, use the logger field that equals the tag.

  • Use logs of a particular Pod or container by following the below order, with the first match winning:

    1. The value of the app Pod label. For example, for app=opensearch-master, use opensearch-master as the log tag.

    2. The value of the k8s-app Pod label.

    3. The value of the app.kubernetes.io/name Pod label.

    4. If a release_group Pod label exists and the component Pod label starts with app, use the value of the component label as the tag. Otherwise, the tag is the application label joined to the component label with a -.

    5. The name of the container from which the log is taken.

The values for tag_exclude and tag_include are placed into <match> directives of Fluentd and only accept regex types that are supported by the <match> directive of Fluentd. For details, refer to the Fluentd official documentation.

'{fluentd-logs,systemd}'

tag_include (string) Since 2.23.0

Optional. Is overridden by tag_exclude. Sets logs by tags to include to the destination output. For example, to include all logs with the auth tag, set tag_include: '/.*auth.*/'.

'/.*auth.*/'

OpenSearch extra settings

Key

Description

Example values

logging.extraConfig (map)

Additional configuration for opensearch.yml.

logging:
  extraConfig:
    cluster.max_shards_per_node: 5000

OpenSearch Dashboards extra settings

Key

Description

Example values

logging.dashboardsExtraConfig (map)

Additional configuration for opensearch_dashboards.yml.

logging:
  dashboardsExtraConfig:
    opensearch.requestTimeout: 60000

High availability

Key

Description

Example values

highAvailabilityEnabled (bool) Mandatory

Enables or disables StackLight multiserver mode. For details, see StackLight database modes in Deployment architecture. On managed clusters, set to false by default. On management and regional clusters, true is mandatory.

true or false

Prometheus

Key

Description

Example values

prometheusServer.retentionTime (string)

Defines the Prometheus database retention period. Passed to the --storage.tsdb.retention.time flag. Set to 15d by default.

15d, 1000h, 10d12h

prometheusServer.retentionSize (string)

Defines the Prometheus database retention size. Passed to the --storage.tsdb.retention.size flag. Set to 15GB by default.

15GB, 512MB

prometheusServer.alertResendDelay (string)

Defines the minimum amount of time for Prometheus to wait before resending an alert to Alertmanager. Passed to the --rules.alert.resend-delay flag. Set to 2m by default.

2m, 90s

prometheusServer.persistentVolumeClaimSize (string) Mandatory

Specifies the Prometheus PVC(s) size. The number of PVCs depends on the StackLight database mode. For HA, three PVCs will be created, each of the size specified in this parameter. For non-HA, one PVC of the specified size.

Important

You cannot modify this parameter after cluster creation.

prometheusServer:
  persistentVolumeClaimSize: 16Gi

Prometheus remote write

Allows sending of metrics from Prometheus to a custom monitoring endpoint. For details, see Prometheus Documentation: remote_write.

Key

Description

Example values

prometheusServer.remoteWriteSecretMounts (slice)

Skip this step if your remote server does not have authorization. Defines additional mounts for remoteWrites secrets. Secret objects with credentials needed to access the remote endpoint must be precreated in the stacklight namespace. For details, see Kubernetes Secrets.

Note

To create more than one file for the same remote write endpoint, for example, to configure TLS connections, use a single secret object with multiple keys in the data field. Using the following example configuration, two files will be created, cert_file and key_file:

...
  data:
    cert_file: aWx1dnRlc3Rz
    key_file: dGVzdHVzZXI=
...
remoteWriteSecretMounts:
- secretName: prom-secret-files
  mountPath: /etc/config/remote_write

prometheusServer.remoteWrites (slice)

Defines the configuration of a custom remote_write endpoint for sending Prometheus samples.

Note

If the remote server uses authorization, first create secret(s) in the stacklight namespace and mount them to Prometheus through prometheusServer.remoteWriteSecretMounts. Then define the created secret in the authorization field.

remoteWrites:
-  url: http://remote_url/push
   authorization:
     credentials_file: /etc/config/remote_write/key_file

Prometheus Relay

Note

Prometheus Relay is set up as an endpoint in the Prometheus datasource in Grafana. Therefore, all requests from Grafana are sent to Prometheus through Prometheus Relay. If Prometheus Relay reports request timeouts or exceeds the response size limits, you can configure the parameters below. In this case, Prometheus Relay resource limits may also require tuning.

Key

Description

Example values

prometheusRelay.clientTimeout (string)

Specifies the client timeout in seconds.

10

prometheusRelay.responseLimitBytes (string)

Specifies the response size limit in bytes.

1048576

Custom Prometheus recording rules

Key

Description

Example values

prometheusServer.customRecordingRules (slice)

Defines custom Prometheus recording rules. Overriding of existing recording rules is not supported.

customRecordingRules:
- name: ExampleRule.http_requests_total
  rules:
  - expr: sum by(job) (rate(http_requests_total[5m]))
    record: job:http_requests:rate5m
  - expr: avg_over_time(job:http_requests:rate5m[1w])
    record: job:http_requests:rate5m:avg_over_time_1w

Custom Prometheus scrape configurations

Key

Description

Example values

prometheusServer.customScrapeConfigs (map)

Defines custom Prometheus scrape configurations. For details, see Prometheus documentation: scrape_config. The names of default StackLight scrape configurations, which you can view in the Status -> Targets tab of the Prometheus web UI, are reserved for internal usage and any overrides will be discarded. Therefore, provide unique names to avoid overrides.

customScrapeConfigs:
  custom-grafana:
    scrape_interval: 10s
    scrape_timeout: 5s
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels:
      - __meta_kubernetes_service_label_app
      - __meta_kubernetes_endpoint_port_name
      regex: grafana;service
      action: keep
    - source_labels:
      - __meta_kubernetes_pod_name
      target_label: pod

Cluster size

Key

Description

Example values

clusterSize (string)

Specifies the approximate expected cluster size. Set to small by default. Other possible values include medium and large. Depending on the choice, appropriate resource limits are passed according to the resourcesPerClusterSize parameter. The values differ by the OpenSearch and Prometheus resource limits:

  • small (default) - 2 CPU, 6 Gi RAM for OpenSearch, 1 CPU, 8 Gi RAM for Prometheus. Use small only for testing and evaluation purposes with no workloads expected.

  • medium - 4 CPU, 16 Gi RAM for OpenSearch, 3 CPU, 16 Gi RAM for Prometheus.

  • large - 8 CPU, 32 Gi RAM for OpenSearch, 6 CPU, 32 Gi RAM for Prometheus. Set to large only in case of lack of resources for OpenSearch and Prometheus.

small, medium, or large

Resource limits

Key

Description

Example values

resourcesPerClusterSize (map)

Provides the capability to override the default resource requests or limits for any StackLight component for the predefined cluster sizes.

StackLight components for resource limits customization

Note

The below list has the componentName: <podNamePrefix>/<containerName> format.

alerta: alerta/alerta
alertmanager: prometheus-alertmanager/prometheus-alertmanager
alertmanagerWebhookServicenow: alertmanager-webhook-servicenow/alertmanager-webhook-servicenow
blackboxExporter: prometheus-blackbox-exporter/blackbox-exporter
cerebro: cerebro/cerebro
elasticsearch: opensearch-master/opensearch # Deprecated
elasticsearchCurator: elasticsearch-curator/elasticsearch-curator
elasticsearchExporter: elasticsearch-exporter/elasticsearch-exporter
fluentdElasticsearch: fluentd-logs/fluentd-logs # Deprecated
fluentdLogs: fluentd-logs/fluentd-logs
fluentdNotifications: fluentd-notifications/fluentd # for MOSK
grafana: grafana/grafana
grafanaRenderer: grafana/grafana-renderer
iamProxy: iam-proxy/iam-proxy # Deprecated
iamProxyAlerta: iam-proxy-alerta/iam-proxy
iamProxyAlertmanager: iam-proxy-alertmanager/iam-proxy
iamProxyGrafana: iam-proxy-grafana/iam-proxy
iamProxyKibana: iam-proxy-kibana/iam-proxy # Deprecated
iamProxyOpenSearchDashboards: iam-proxy-kibana/iam-proxy
iamProxyPrometheus: iam-proxy-prometheus/iam-proxy
kibana: opensearch-dashboards/opensearch-dashboards # Deprecated
kubeStateMetrics: prometheus-kube-state-metrics/prometheus-kube-state-metrics
libvirtExporter: prometheus-libvirt-exporter/prometheus-libvirt-exporter # for MOSK
metricCollector: metric-collector/metric-collector
metricbeat: metricbeat/metricbeat
nodeExporter: prometheus-node-exporter/prometheus-node-exporter
opensearch: opensearch-master/opensearch
opensearchDashboards: opensearch-dashboards/opensearch-dashboards
patroniExporter: patroni/patroni-patroni-exporter
pgsqlExporter: patroni/patroni-pgsql-exporter
postgresql: patroni/patroni
prometheusEsExporter: prometheus-es-exporter/prometheus-es-exporter
prometheusMsTeams: prometheus-msteams/prometheus-msteams
prometheusRelay: prometheus-relay/prometheus-relay
prometheusServer: prometheus-server/prometheus-server
refapp: refapp/refapp
refappCleanup: refapp-cleanup/refapp-cleanup
refappInit: db-init/db-init
sfNotifier: sf-notifier/sf-notifier
sfReporter: sf-reporter/sf-reporter
stacklightHelmControllerController: stacklight-helm-controller/controller
telegrafDockerSwarm: telegraf-docker-swarm/telegraf-docker-swarm
telegrafDs: telegraf-ds-smart/telegraf-ds-smart # Deprecated
telegrafDsSmart: telegraf-ds-smart/telegraf-ds-smart
telegrafOpenstack: telegraf-openstack/telegraf-openstack # for MOSK
telegrafS: telegraf-docker-swarm/telegraf-docker-swarm # deprecated, telegraf-openstack/telegraf-openstack # for MOSK, depreated
telemeterClient: telemeter-client/telemeter-client
telemeterServer: telemeter-server/telemeter-server
telemeterServerAuthServer: telemeter-server/telemeter-server-authorization-server
tfControllerExporter: prometheus-tf-controller-exporter/prometheus-tungstenfabric-exporter # for MOSK
tfVrouterExporter: prometheus-tf-vrouter-exporter/prometheus-tungstenfabric-exporter # for MOSK
resourcesPerClusterSize:
  # elasticsearch:
  opensearch:
    small:
      limits:
        cpu: "1000m"
        memory: "4Gi"
    medium:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "4Gi"
    large:
      limits:
        cpu: "4000m"
        memory: "16Gi"

resources (map)

Provides the capability to override the containers resource requests or limits for any StackLight component.

StackLight components for resource limits customization

Note

The below list has the componentName: <podNamePrefix>/<containerName> format.

alerta: alerta/alerta
alertmanager: prometheus-alertmanager/prometheus-alertmanager
alertmanagerWebhookServicenow: alertmanager-webhook-servicenow/alertmanager-webhook-servicenow
blackboxExporter: prometheus-blackbox-exporter/blackbox-exporter
cerebro: cerebro/cerebro
elasticsearch: opensearch-master/opensearch # Deprecated
elasticsearchCurator: elasticsearch-curator/elasticsearch-curator
elasticsearchExporter: elasticsearch-exporter/elasticsearch-exporter
fluentdElasticsearch: fluentd-logs/fluentd-logs # Deprecated
fluentdLogs: fluentd-logs/fluentd-logs
fluentdNotifications: fluentd-notifications/fluentd # for MOSK
grafana: grafana/grafana
grafanaRenderer: grafana/grafana-renderer
iamProxy: iam-proxy/iam-proxy # Deprecated
iamProxyAlerta: iam-proxy-alerta/iam-proxy
iamProxyAlertmanager: iam-proxy-alertmanager/iam-proxy
iamProxyGrafana: iam-proxy-grafana/iam-proxy
iamProxyKibana: iam-proxy-kibana/iam-proxy # Deprecated
iamProxyOpenSearchDashboards: iam-proxy-kibana/iam-proxy
iamProxyPrometheus: iam-proxy-prometheus/iam-proxy
kibana: opensearch-dashboards/opensearch-dashboards # Deprecated
kubeStateMetrics: prometheus-kube-state-metrics/prometheus-kube-state-metrics
libvirtExporter: prometheus-libvirt-exporter/prometheus-libvirt-exporter # for MOSK
metricCollector: metric-collector/metric-collector
metricbeat: metricbeat/metricbeat
nodeExporter: prometheus-node-exporter/prometheus-node-exporter
opensearch: opensearch-master/opensearch
opensearchDashboards: opensearch-dashboards/opensearch-dashboards
patroniExporter: patroni/patroni-patroni-exporter
pgsqlExporter: patroni/patroni-pgsql-exporter
postgresql: patroni/patroni
prometheusEsExporter: prometheus-es-exporter/prometheus-es-exporter
prometheusMsTeams: prometheus-msteams/prometheus-msteams
prometheusRelay: prometheus-relay/prometheus-relay
prometheusServer: prometheus-server/prometheus-server
refapp: refapp/refapp
refappCleanup: refapp-cleanup/refapp-cleanup
refappInit: db-init/db-init
sfNotifier: sf-notifier/sf-notifier
sfReporter: sf-reporter/sf-reporter
stacklightHelmControllerController: stacklight-helm-controller/controller
telegrafDockerSwarm: telegraf-docker-swarm/telegraf-docker-swarm
telegrafDs: telegraf-ds-smart/telegraf-ds-smart # Deprecated
telegrafDsSmart: telegraf-ds-smart/telegraf-ds-smart
telegrafOpenstack: telegraf-openstack/telegraf-openstack # for MOSK
telegrafS: telegraf-docker-swarm/telegraf-docker-swarm # deprecated, telegraf-openstack/telegraf-openstack # for MOSK, depreated
telemeterClient: telemeter-client/telemeter-client
telemeterServer: telemeter-server/telemeter-server
telemeterServerAuthServer: telemeter-server/telemeter-server-authorization-server
tfControllerExporter: prometheus-tf-controller-exporter/prometheus-tungstenfabric-exporter # for MOSK
tfVrouterExporter: prometheus-tf-vrouter-exporter/prometheus-tungstenfabric-exporter # for MOSK
resources:
  alerta:
    requests:
      cpu: "50m"
      memory: "200Mi"
    limits:
      memory: "500Mi"

Using the example above, each pod in the alerta service will be requesting 50 millicores of CPU and 200 MiB of memory, while being hard-limited to 500 MiB of memory usage. Each configuration key is optional.

Note

The logging mechanism performance depends on the cluster log load. If the cluster components send an excessive amount of logs, the default resource requests and limits for fluentdLogs (or fluentdElasticsearch) may be insufficient, which may cause its pods to be OOMKilled and trigger the KubePodCrashLooping alert. In such case, increase the default resource requests and limits for fluentdLogs. For example:

resources:
  # fluentdElasticsearch:
  fluentdLogs:
    requests:
      memory: "500Mi"
    limits:
      memory: "1500Mi"

Byte limit for Telemeter client

For internal StackLight use only

Key

Description

Example values

telemetry.telemeterClient.limitBytes (string)

Specifies the size limit of the incoming data length in bytes for the Telemeter client. Defaults to 1048576.

4194304

Kubernetes tolerations

Key

Description

Example values

tolerations.default (slice)

Kubernetes tolerations to add to all StackLight components.

default:
- key: "com.docker.ucp.manager"
  operator: "Exists"
  effect: "NoSchedule"

tolerations.component (map)

Defines Kubernetes tolerations (overrides the default ones) for any StackLight component.

component:
  # elasticsearch:
  opensearch:
  - key: "com.docker.ucp.manager"
    operator: "Exists"
    effect: "NoSchedule"
  postgresql:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"

Storage class

In an HA StackLight setup, when highAvailabilityEnabled is set to true, all StackLight Persistent Volumes (PVs) use the Local Volume Provisioner (LVP) storage class not to rely on dynamic provisioners such as Ceph, which are not available in every Container Cloud deployment. In a non-HA StackLight setup, when no storage class is specified, PVs use the default storage class of a cluster.

Key

Description

Example values

storage.defaultStorageClass (string)

Defines the StorageClass to use for all StackLight Persistent Volume Claims (PVCs) if a component StorageClass is not defined using the componentStorageClasses. To use the default storage class, leave the string empty.

lvp, standard

storage.componentStorageClasses (map)

Defines (overrides the defaultStorageClass value) the storage class for any StackLight component separately. To use the default storage class, leave the string empty.

componentStorageClasses:
  elasticsearch: ""
  opensearch: ""
  fluentd: ""
  postgresql: ""
  prometheusAlertManager: ""
  prometheusServer: ""

NodeSelector

Key

Description

Example values

nodeSelector.default (map)

Defines the NodeSelector to use for the most of StackLight pods (except some pods that refer to DaemonSets) if the NodeSelector of a component is not defined.

default:
  role: stacklight

nodeSelector.component (map)

Defines the NodeSelector to use for particular StackLight component pods. Overrides nodeSelector.default.

component:
  alerta:
    role: stacklight
    component: alerta
  # kibana:
  #   role: stacklight
  #   component: kibana
  opensearchDashboards:
    role: stacklight
    component: opensearchdashboards

Prometheus Node Exporter

Key

Description

Example values

nodeExporter.netDeviceExclude (string)

Excludes monitoring of RegExp-specified network devices. The number of network interface-related metrics is significant and may cause extended Prometheus RAM usage in big clusters. Therefore, Prometheus Node Exporter only collects information of a basic set of interfaces (both host and container) and excludes the following monitoring interfaces:

  • veth/cali - the host-side part of the container-host Ethernet tunnel

  • o-hm0 - the OpenStack Octavia management interface for communication with the amphora machine

  • tap, qg-, qr-, ha- - the Open vSwitch virtual bridge ports

  • br-(ex|int|tun) - the Open vSwitch virtual bridges

  • docker0, br- - the Docker bridge (master for the veth interfaces)

  • ovs-system - the Open vSwitch interface (mapping interfaces to bridges)

To enable information collecting for the interfaces above, edit the list of blacklisted devices as needed.

nodeExporter:
  netDeviceExclude: "^(veth.+|cali.+|o-hm0|tap.+|qg-.+|qr-.+|ha-.+|br-.+|ovs-system|docker0)$"

nodeExporter.extraCollectorsEnabled (slice)

Enables Node Exporter collectors. For a list of available collectors, see Node Exporter Collectors. The following collectors are enabled by default in StackLight:

  • arp

  • conntrack

  • cpu

  • diskstats

  • entropy

  • filefd

  • filesystem

  • hwmon

  • loadavg

  • meminfo

  • netdev

  • netstat

  • nfs

  • stat

  • sockstat

  • textfile

  • time

  • timex

  • uname

  • vmstat

extraCollectorsEnabled:
  - bcache
  - bonding
  - softnet

Prometheus Blackbox Exporter

Key

Description

Example values

blackboxExporter.customModules (map)

Specifies a set of custom Blackbox Exporter modules. For details, see Blackbox Exporter configuration: module. The http_2xx, http_2xx_verify, http_openstack, http_openstack_insecure, tls, tls_verify names are reserved for internal usage and any overrides will be discarded.

customModules:
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{}'

blackboxExporter.timeoutOffset (string)

Specifies the offset to subtract from timeout in seconds (--timeout-offset), upper bounded by 5.0 to comply with the built-in StackLight functionality. If nothing is specified, the Blackbox Exporter default value is used. For example, for Blackbox Exporter v0.19.0, the default value is 0.5.

timeoutOffset: "0.1"

Reference Application

Available since 2.21.0 for non-MOSK managed clusters

Note

For the feature support on MOSK deployments, refer to MOSK documentation: Deploy RefApp using automation tools.

Key

Description

Example values

refapp.enabled (bool)

Enables or disables Reference Application that is a small microservice application that enables workload monitoring on non-MOSK managed clusters. Disabled by default.

true or false

refapp.workload.persistentVolumeEnabled (bool)

Available since Container Cloud 2.23.0.
Enables or disables persistent volumes for Reference Application. Enabled by default. Disabling is not recommended for production clusters. Once set, the value cannot be changed.

true or false

refapp.workload.storageClassName (string)

Defines StorageClass to use for Reference Application persistent volumes. Empty by default. If empty, uses the default storage class. Once set, the value cannot be changed. Takes effect only if persistent volumes are enabled.

refapp:
  workload:
    storageClassName: kubernetes-ssd

refapp.workload.persistentVolumeSize (string)

Available since Container Cloud 2.23.0.
Defines the size of persistent volumes for the Reference Application. Default is 1Gi. Applies only if persistent volumes are enabled.
refapp:
  workload:
    persistentVolumeSize: 1Gi

Salesforce reporter

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce reporter depends on the Internet access through HTTPS.

Key

Description

Example values

clusterId (string)

Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-reporter and sf-notifier services.

The clusterId key is automatically defined for each cluster. Do not set or modify it manually.

Do not modify clusterId.

sfReporter.enabled (bool)

Enables or disables reporting of Prometheus metrics to Salesforce. For details, see Deployment architecture. Disabled by default.

true or false

sfReporter.salesForceAuth (map)

Salesforce parameters and credentials for the metrics reporting integration.

Note

Modify this parameter if sf-notifier is not configured or if you want to use a different Salesforce user account to send reports to.

salesForceAuth:
  url: "<SF instance URL>"
  username: "<SF account email address>"
  password: "<SF password>"
  environment_id: "<Cloud identifier>"
  organization_id: "<Organization identifier>"
  sandbox_enabled: "<Set to true or false>"

sfReporter.cronjob (map)

Defines the Kubernetes cron job for sending metrics to Salesforce. By default, reports are sent at midnight server time.

cronjob:
  schedule: "0 0 * * *"
  concurrencyPolicy: "Allow"
  failedJobsHistoryLimit: ""
  successfulJobsHistoryLimit: ""
  startingDeadlineSeconds: 200

Ceph monitoring

Key

Description

Example values

ceph.enabled (bool)

Enables or disables Ceph monitoring on baremetal-based managed clusters. Set to false by default.

true or false

External endpoint monitoring

Key

Description

Example values

externalEndpointMonitoring.enabled (bool)

Enables or disables HTTP endpoints monitoring. If enabled, the monitoring tool performs the probes against the defined endpoints every 15 seconds. Set to false by default.

true or false

externalEndpointMonitoring.certificatesHostPath (string)

Defines the directory path with external endpoints certificates on host.

/etc/ssl/certs/

externalEndpointMonitoring.domains (slice)

Defines the list of HTTP endpoints to monitor. The endpoints must successfully respond to a liveness probe. For success, a request to a specific endpoint must result in a 2xx HTTP response code.

domains:
- https://prometheus.io/health
- http://example.com:8080/status
- http://example.net:8080/pulse

Ironic monitoring

Key

Description

Example values

ironic.endpoint (string)

Enables or disables monitoring of bare metal Ironic on baremetal-based clusters. To enable, specify the Ironic API URL.

http://ironic-api-http.kaas.svc:6385/v1

ironic.insecure (bool)

Defines whether to skip the chain and host verification. Set to false by default.

true or false

SSL certificates monitoring

Key

Description

Example values

sslCertificateMonitoring.enabled (bool)

Enables or disables StackLight to monitor and alert on the expiration date of the TLS certificate of an HTTPS endpoint. If enabled, the monitoring tool performs the probes against the defined endpoints every hour. Set to false by default.

true or false

sslCertificateMonitoring.domains (slice)

Defines the list of HTTPS endpoints to monitor the certificates from.

domains:
- https://prometheus.io
- https://example.com:8080

Workload monitoring

Key

Description

Example values

metricFilter (map)

On the clusters that run large-scale workloads, workload monitoring generates a big amount of resource-consuming metrics. To prevent generation of excessive metrics, you can disable workload monitoring in the StackLight metrics and monitor only the infrastructure.

The metricFilter parameter enables the cAdvisor (Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. Set to false by default. If set to true, you can define the namespaces to which the filter will apply. The parameter is designed for managed clusters.

metricFilter:
  enabled: true
  action: keep
  namespaces:
  - kaas
  - kube-system
  - stacklight
  • enabled - enable or disable metricFilter using true or false

  • action - action to take by Prometheus:

    • keep - keep only metrics from namespaces that are defined in the namespaces list

    • drop - ignore metrics from namespaces that are defined in the namespaces list

  • namespaces - list of namespaces to keep or drop metrics from regardless of the boolean value for every namespace

Mirantis Kubernetes Engine monitoring

Key

Description

Example values

mke.enabled (bool)

Enables or disables Mirantis Kubernetes Engine (MKE) monitoring. Set to true by default.

true or false

mke.dockerdDataRoot (string)

Defines the dockerd data root directory of persistent Docker state. For details, see Docker documentation: Daemon CLI (dockerd).

/var/lib/docker

Alerts configuration

Key

Description

Example values

prometheusServer.customAlerts (slice)

Defines custom alerts. Also, modifies or disables existing alert configurations. For the list of predefined alerts, see Available StackLight alerts. While adding or modifying alerts, follow the Alerting rules.

customAlerts:
# To add a new alert:
- alert: ExampleAlert
  annotations:
    description: Alert description
    summary: Alert summary
  expr: example_metric > 0
  for: 5m
  labels:
    severity: warning
# To modify an existing alert expression:
- alert: AlertmanagerFailedReload
  expr: alertmanager_config_last_reload_successful == 5
# To disable an existing alert:
- alert: TargetDown
  enabled: false

An optional field enabled is accepted in the alert body to disable an existing alert by setting to false. All fields specified using the customAlerts definition override the default predefined definitions in the charts’ values.

Watchdog alert

Key

Description

Example values

prometheusServer.watchDogAlertEnabled (bool)

Enables or disables the Watchdog alert that constantly fires as long as the entire alerting pipeline is functional. You can use this alert to verify that Alertmanager notifications properly flow to the Alertmanager receivers. Set to true by default.

true or false

Alertmanager integrations

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled, for example, for the Salesforce integration and Alertmanager notifications external rules.

Key

Description

Example values

alertmanagerSimpleConfig.genericReceivers (slice)

Provides a generic template for notifications receiver configurations. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

For example, to enable notifications to OpsGenie:

alertmanagerSimpleConfig:
  genericReceivers:
  - name: HTTP-opsgenie
    enabled: true # optional
    opsgenie_configs:
    - api_url: "https://example.app.eu.opsgenie.com/"
      api_key: "secret-key"
      send_resolved: true

alertmanagerSimpleConfig.genericRoutes (slice)

Provides a template for notifications route configuration. For details, see Prometheus Alertmanager documentation: Route.

genericRoutes:
- receiver: HTTP-opsgenie
  enabled: true # optional
  matchers:
    severity=~"major|critical"
  continue: true

alertmanagerSimpleConfig.inhibitRules.enabled (bool)

Disables or enables alert inhibition rules. If enabled, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Enabled by default. For details, see Alert dependencies. For details on inhibition rules, see Prometheus documentation.

true or false

Notifications to email

Key

Description

Example values

alertmanagerSimpleConfig.email.enabled (bool)

Enables or disables Alertmanager integration with email. Set to false by default.

true or false

alertmanagerSimpleConfig.email (map)

Defines the notification parameters for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Email configuration.

email:
  enabled: false
  send_resolved: true
  to: "to@test.com"
  from: "from@test.com"
  smarthost: smtp.gmail.com:587
  auth_username: "from@test.com"
  auth_password: password
  auth_identity: "from@test.com"
  require_tls: true

alertmanagerSimpleConfig.email.route (map)

Defines the route for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers: []
  routes: []

Notifications to Salesforce

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce integration depends on the Internet access through HTTPS.

Key

Description

Example values

clusterId (string)

Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-notifier and sf-reporter services.

The clusterId is automatically defined for each cluster. Do not set or modify it manually.

Do not modify clusterId.

alertmanagerSimpleConfig.salesForce.enabled (bool)

Enables or disables Alertmanager integration with Salesforce using the sf-notifier service. Disabled by default.

true or false

alertmanagerSimpleConfig.salesForce.auth (map)

Defines the Salesforce parameters and credentials for integration with Alertmanager.

auth:
  url: "<SF instance URL>"
  username: "<SF account email address>"
  password: "<SF password>"
  environment_id: "<Cloud identifier>"
  organization_id: "<Organization identifier>"
  sandbox_enabled: "<Set to true or false>"

alertmanagerSimpleConfig.salesForce.route (map)

Defines the notifications route for Alertmanager integration with Salesforce. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers:
  - severity="critical"
  routes: []

Note

By default, only Critical alerts will be sent to Salesforce.

alertmanagerSimpleConfig.salesForce.feed_enabled (bool)

Enables or disables feed update in Salesforce. To save API calls, this parameter is set to false by default.

true or false

alertmanagerSimpleConfig.salesForce.link_prometheus (bool)

Enables or disables links to the Prometheus web UI in alerts sent to Salesforce. To simplify troubleshooting, set to true by default.

true or false

Notifications to Slack

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Slack integration depends on the Internet access through HTTPS.

Key

Description

Example values

alertmanagerSimpleConfig.slack.enabled (bool)

Enables or disables Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Slack configuration. Set to false by default.

true or false

alertmanagerSimpleConfig.slack.api_url (string)

Defines the Slack webhook URL.

http://localhost:8888

alertmanagerSimpleConfig.slack.channel (string)

Defines the Slack channel or user to send notifications to.

monitoring

alertmanagerSimpleConfig.slack.route (map)

Defines the notifications route for Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers: []
  routes: []

Notifications to Microsoft Teams

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Microsoft Teams integration depends on the Internet access through HTTPS.

Key

Description

Example values

alertmanagerSimpleConfig.msteams.enabled (bool)

Enables or disables Alertmanager integration with Microsoft Teams. Requires a set up Microsoft Teams channel and a channel connector. Set to false by default.

true or false

alertmanagerSimpleConfig.msteams.url (string)

Defines the URL of an Incoming Webhook connector of a Microsoft Teams channel. For details about channel connectors, see Microsoft documentation.

https://example.webhook.office.com/webhookb2/UUID

Notifications to ServiceNow

Caution

Prior to configuring the integration with ServiceNow, perform the following prerequisite steps using the ServiceNow documentation of the required version.

  1. In a new or existing Incident table, add the Alert ID field as described in Add fields to a table. To avoid alerts duplication, select Unique.

  2. Create an Access Control List (ACL) with read/write permissions for the Incident table as described in Securing table records.

  3. Set up a service account.

Key

Description

Example values

alertmanagerSimpleConfig.serviceNow.enabled (bool)

Enables or disables Alertmanager integration with ServiceNow. Set to false by default. Requires a set up ServiceNow account and compliance with the Incident table requirements above.

true or false

alertmanagerSimpleConfig.serviceNow (map)

Defines the ServiceNow parameters and credentials for integration with Alertmanager:

  • incident_table - name of the table created in ServiceNow. Do not confuse with the table label.

  • api_version - version of the ServiceNow HTTP API. By default, v1.

  • alert_id_field - name of the unique string field configured in ServiceNow to hold Prometheus alert IDs. Do not confuse with the table label.

  • auth.instance - URL of the instance.

  • auth.username - name of the ServiceNow user account with access to Incident table.

  • auth.password - password of the ServiceNow user account.

serviceNow:
  enabled: true
  incident_table: "incident"
  api_version: "v1"
  alert_id_field: "u_alert_id"
  auth:
    instance: "https://dev00001.service-now.com"
    username: "testuser"
    password: "testpassword"