Newer documentation is now live.You are currently reading an older version.

StackLight configuration parameters

This section describes the StackLight configuration keys that you can specify in the values section to change StackLight settings as required. Prior to making any changes to StackLight configuration, perform the steps described in StackLight configuration procedure. After changing StackLight configuration, verify the changes as described in Verify StackLight after configuration.

Important

Some parameters are marked as mandatory. Failure to specify values for such parameters causes the Admission Controller to reject cluster creation.

OpenStack cluster configuration parameters

This section describes the OpenStack-related StackLight configuration keys. For MOSK cluster configuration keys, see MOSK cluster configuration parameters.

General

  • openstack.enabled (bool)

    Enables OpenStack monitoring. Defaults to true.

  • openstack.namespace (string)

    Defines the namespace within which the OpenStack virtualized control plane is installed. Defaults to openstack.

openstack:
  enabled: true
  namespace: openstack

Gnocchi

openstack.gnocchi.enabled (bool)

Enables Gnocchi monitoring. Set to false by default.

openstack:
  gnocchi:
    enabled: false

Ironic

openstack.ironic.enabled (bool)

Enables Ironic monitoring. Defaults to false.

openstack:
  ironic:
    enabled: false

RabbitMQ

  • openstack.rabbitmq.credentialsConfig (map)

    Defines the RabbitMQ credentials to use if credentials discovery is disabled or some required parameters were not found during the discovery.

  • openstack.rabbitmq.credentialsDiscovery (map)

    Enables the credentials discovery to obtain the username and password from the secret object.

openstack:
  rabbitmq:
    credentialsConfig:
      username: "stacklight"
      password: "stacklight"
      host: "rabbitmq.openstack.svc"
      queue: "notifications"
      vhost: "openstack"
    credentialsDiscovery:
      enabled: true
      namespace: openstack
      secretName: os-rabbitmq-user-credentials

SSL certificates

  • openstack.externalFQDN (string) Deprecated

    External FQDN used to communicate with OpenStack services for certificates monitoring. For example, https://os.ssl.mirantis.net/. The option is deprecated, use openstack.externalFQDNs.enabled instead.

  • openstack.externalFQDNs.enabled (bool)

    External FQDN used to communicate with OpenStack services for certificates monitoring. Defaults to false.

  • openstack.insecure (string)

    Defines whether to verify the trust chain of the OpenStack endpoint SSL certificates during monitoring.

openstack:
  externalFQDNs:
    enabled: false
  insecure:
    internal: true
    external: false

Tungsten Fabric

  • tungstenFabricMonitoring.enabled (bool)

    Enables Tungsten Fabric monitoring. Defaults to true if Tungsten Fabric is deployed.

  • tungstenFabricMonitoring.exportersTimeout (string)

    Defines the timeout of the tungstenfabric-exporter client requests. Defaults to 5s.

  • tungstenFabricMonitoring.analyticsEnabled (bool)

    Enables or disables monitoring of the Tungsten Fabric analytics services. The default value is set automatically based on the real state of the Tungsten Fabric analytics services (enabled or disabled) in the Tungsten Fabric cluster.

tungstenFabricMonitoring:
  enabled: true
  exportersTimeout: "5s"
  analyticsEnabled: true

MOSK cluster configuration parameters

This section describes the MOSK cluster StackLight configuration keys. For OpenStack cluster configuration keys, see OpenStack cluster configuration parameters.

Alert configuration

prometheusServer.customAlerts (slice)

Defines custom alerts. Also, modifies or disables existing alert configurations. For the list of predefined alerts, see StackLight alerts. While adding or modifying alerts, follow the Alerting rules.

prometheusServer:
  customAlerts:
  # To add a new alert:
  - alert: ExampleAlert
    annotations:
      description: Alert description
      summary: Alert summary
    expr: example_metric > 0
    for: 5m
    labels:
      severity: warning
  # To modify an existing alert expression:
  - alert: AlertmanagerFailedReload
    expr: alertmanager_config_last_reload_successful == 5
  # To disable an existing alert:
  - alert: TargetDown
    enabled: false

An optional field enabled is accepted in the alert body to disable an existing alert by setting to false. All fields specified using the customAlerts definition override the default predefined definitions in the charts’ values.

Alerta

alerta.enabled (bool)

Enables or disables Alerta. Using the Alerta web UI, you can view the most recent or watched alerts, group, and filter alerts. Defaults to true.

alerta:
  enabled: true

Alertmanager integrations

On MOSK clusters with limited internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled, for example, for the Salesforce integration and Alertmanager notifications external rules.

  • alertmanagerSimpleConfig.genericReceivers (slice)

    Provides a generic template for notifications receiver configurations. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

    For example, to enable notifications to OpsGenie:

    alertmanagerSimpleConfig:
      genericReceivers:
      - name: HTTP-opsgenie
        enabled: true # optional
        opsgenie_configs:
        - api_url: "https://example.app.eu.opsgenie.com/"
          api_key: "secret-key"
          send_resolved: true
    
  • alertmanagerSimpleConfig.genericRoutes (slice)

    Provides a template for notifications route configuration. For details, see Prometheus Alertmanager documentation: Route.

    alertmanagerSimpleConfig:
      genericRoutes:
      - receiver: HTTP-opsgenie
        enabled: true # optional
        matchers:
          severity=~"major|critical"
        continue: true
    
  • alertmanagerSimpleConfig.inhibitRules.enabled (bool)

    Disables or enables alert inhibition rules. If enabled, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Enabled by default. For details, see Alert dependencies. For details on inhibition rules, see Prometheus documentation.

    alertmanagerSimpleConfig:
      inhibitRules:
        enabled: true
    

Alertmanager: notifications to email

  • alertmanagerSimpleConfig.email.enabled (bool)

    Enables or disables Alertmanager integration with email. Defaults to false.

    alertmanagerSimpleConfig:
      email:
        enabled: false
    
  • alertmanagerSimpleConfig.email (map)

    Defines the notification parameters for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Email configuration.

    alertmanagerSimpleConfig:
      email:
        enabled: false
        send_resolved: true
        to: "to@test.com"
        from: "from@test.com"
        smarthost: smtp.gmail.com:587
        auth_username: "from@test.com"
        auth_password: password
        auth_identity: "from@test.com"
        require_tls: true
    
  • alertmanagerSimpleConfig.email.customTemplates (slice)

    Defines custom notification templates for email alerts. You can override subject, html, and text templates while keeping the define and end blocks unchanged. You can customize each template independently.

    The following example is provided for demonstration purposes. For production environments, use the official Prometheus documentation to create relevant templates that fit your deployment.

    Simple example template
    alertmanagerSimpleConfig:
      email:
        customTemplates:
          subject: |
            {{- define "email.notification.subject" -}}
            {{ if (index .Alerts 0).Labels.severity }}{{ (index .Alerts 0).Labels.severity | toUpper }}: {{ end }}{{ (index .Alerts 0).Labels.alertname }}{{ if (index .Alerts 0).Labels.cluster }} [{{ (index .Alerts 0).Labels.cluster }}]{{ end }}
            {{- end -}}
          html: |
            {{- define "email.notification.html" -}}
            {{ range .Alerts }}
            <b>{{ .Labels.alertname }}</b> ({{ .Labels.severity }}) [{{ .Labels.cluster }}]<br>
            {{ .Annotations.description }}<br>
            {{ end }}
            {{- end -}}
          text: |
            {{- define "email.notification.text" -}}
            {{ range .Alerts }}
            Alert: {{ .Labels.alertname }} ({{ .Labels.severity }}) [{{ .Labels.cluster }}]
            {{ .Annotations.description }}
            {{ end }}
            {{- end -}}
    
    Example notification
    Subject: CRITICAL: KernelIOErrorsDetectedFake [west-cluster]
    
    Text:
    KernelIOErrorsDetectedFake (critical) [west-cluster]
    The kaas-node-0f724961-95e8-483f-a466-b2759ff9c5ea node kernel reports IO errors. Investigate kernel logs.
    KernelIOErrorsDetectedFake (critical) [west-cluster]
    The kaas-node-c7b4982c-ae86-4055-92db-dc9a8afd3bed node kernel reports IO errors. Investigate kernel logs.
    
  • alertmanagerSimpleConfig.email.route (map)

    Defines the route for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Route.

    alertmanagerSimpleConfig:
      email:
        route:
          matchers: []
          routes: []
    

Alertmanager: notifications to Microsoft Teams

On MOSK clusters with limited internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Microsoft Teams integration depends on the internet access through HTTPS.

  • alertmanagerSimpleConfig.msteams.enabled (bool)

    Enables or disables Alertmanager integration with Microsoft Teams. Requires a set up Microsoft Teams channel and a channel connector. Defaults to false.

    alertmanagerSimpleConfig:
      msteams:
        enabled: false
    
  • alertmanagerSimpleConfig.msteams.url (string)

    Defines the URL of an Incoming Webhook connector of a Microsoft Teams channel. For details about channel connectors, see Microsoft documentation.

    alertmanagerSimpleConfig:
      msteams:
        url: "https://example.webhook.office.com/webhookb2/UUID"
    
  • alertmanagerSimpleConfig.msteams.route (map)

    Defines the notifications route for Alertmanager integration with MS Teams. For details, see Prometheus Alertmanager documentation: Route.

    alertmanagerSimpleConfig:
      msteams:
        route:
          matchers: []
          routes: []
    

Alertmanager: notifications to Salesforce

On the MOSK clusters with limited internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce integration depends on the internet access through HTTPS.

  • clusterId (string)

    Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-notifier and sf-reporter services.

    The clusterId is automatically defined for each cluster. Do not set or modify it manually.

  • alertmanagerSimpleConfig.salesForce.enabled (bool)

    Enables or disables Alertmanager integration with Salesforce using the sf-notifier service. Disabled by default.

    alertmanagerSimpleConfig:
      salesForce:
        enabled: false
    
  • alertmanagerSimpleConfig.salesForce.auth (map)

    Defines the Salesforce parameters and credentials for integration with Alertmanager.

    alertmanagerSimpleConfig:
      salesForce:
        auth:
          url: "<SF instance URL>"
          username: "<SF account email address>"
          password: "<SF password>"
          environment_id: "<Cloud identifier>"
          organization_id: "<Organization identifier>"
          sandbox_enabled: "<Set to true or false>"
    
  • alertmanagerSimpleConfig.salesForce.route (map)

    Defines the notifications route for Alertmanager integration with Salesforce. For details, see Prometheus Alertmanager documentation: Route.

    alertmanagerSimpleConfig:
      salesForce:
        route:
          matchers:
          - severity="critical"
          routes: []
    

    Note

    By default, only Critical alerts will be sent to Salesforce.

  • alertmanagerSimpleConfig.salesForce.feed_enabled (bool)

    Enables or disables feed update in Salesforce. To save API calls, this defaults to false.

    alertmanagerSimpleConfig:
      salesForce:
        feed_enabled: false
    
  • alertmanagerSimpleConfig.salesForce.link_prometheus (bool)

    Enables or disables links to the Prometheus web UI in alerts sent to Salesforce. To simplify troubleshooting, defaults to true.

    alertmanagerSimpleConfig:
      salesForce:
        link_prometheus: true
    

Alertmanager: notifications to ServiceNow

Caution

Prior to configuring the integration with ServiceNow, perform the following prerequisite steps using the ServiceNow documentation of the required version.

  1. In a new or existing Incident table, add the Alert ID field as described in Add fields to a table. To avoid alerts duplication, select Unique.

  2. Create an Access Control List (ACL) with read/write permissions for the Incident table as described in Securing table records.

  3. Set up a service account.

  • alertmanagerSimpleConfig.serviceNow.enabled (bool)

    Enables or disables Alertmanager integration with ServiceNow. Defaults to false. Requires a set up ServiceNow account and compliance with the Incident table requirements above.

    alertmanagerSimpleConfig:
      serviceNow:
        enabled: false
    
  • alertmanagerSimpleConfig.serviceNow (map)

    Defines the ServiceNow parameters and credentials for integration with Alertmanager:

    • incident_table - name of the table created in ServiceNow. Do not confuse with the table label.

    • api_version - version of the ServiceNow HTTP API. By default, v1.

    • alert_id_field - name of the unique string field configured in ServiceNow to hold Prometheus alert IDs. Do not confuse with the table label.

    • auth.instance - URL of the instance.

    • auth.username - name of the ServiceNow user account with access to Incident table.

    • auth.password - password of the ServiceNow user account.

    alertmanagerSimpleConfig:
      serviceNow:
        enabled: true
        incident_table: "incident"
        api_version: "v1"
        alert_id_field: "u_alert_id"
        auth:
          instance: "https://dev00001.service-now.com"
          username: "testuser"
          password: "testpassword"
    

Alertmanager: notifications to Slack

On MOSK clusters with limited internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Slack integration depends on the internet access through HTTPS.

  • alertmanagerSimpleConfig.slack.enabled (bool)

    Enables or disables Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Slack configuration. Defaults to false.

    alertmanagerSimpleConfig:
      slack:
        enabled: false
    
  • alertmanagerSimpleConfig.slack.api_url (string)

    Defines the Slack webhook URL.

    alertmanagerSimpleConfig:
      slack:
        api_url: "http://localhost:8888"
    
  • alertmanagerSimpleConfig.slack.channel (string)

    Defines the Slack channel or user to send notifications to.

    alertmanagerSimpleConfig:
      slack:
        channel: "monitoring"
    
  • alertmanagerSimpleConfig.slack.customTemplates (slice)

    Defines custom notification templates for Slack alerts. You can override title and text templates while keeping the define and end blocks unchanged. You can customize each template independently.

    The following example is provided for demonstration purposes. For production environments, use the official Prometheus documentation to create relevant templates that fit your deployment.

    Simple example template
    alertmanagerSimpleConfig:
      slack:
        customTemplates:
          title: |
            {{- define "slack.notification.title" -}}
            {{ if (index .Alerts 0).Labels.severity }}{{ (index .Alerts 0).Labels.severity | toUpper }}: {{ end }}{{ (index .Alerts 0).Labels.alertname }}{{ if (index .Alerts 0).Labels.cluster }} [{{ (index .Alerts 0).Labels.cluster }}]{{ end }}
            {{- end -}}
          text: |
            {{- define "slack.notification.text" -}}
            {{ range .Alerts }}
            *{{ .Labels.alertname }}* ({{ .Labels.severity }}) [{{ .Labels.cluster }}]
            {{ .Annotations.description }}
            {{ end }}
            {{- end -}}
    
    Example notification
    Title: CRITICAL: KernelIOErrorsDetectedFake [west-cluster]
    
    Text:
    KernelIOErrorsDetectedFake (critical) [west-cluster]
     The kaas-node-0f724961-95e8-483f-a466-b2759ff9c5ea node kernel reports IO errors. Investigate kernel logs.
    
     KernelIOErrorsDetectedFake (critical) [west-cluster]
      The kaas-node-c7b4982c-ae86-4055-92db-dc9a8afd3bed node kernel reports IO errors. Investigate kernel logs.
    
  • alertmanagerSimpleConfig.slack.route (map)

    Defines the notifications route for Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Route.

    alertmanagerSimpleConfig:
      slack:
        route:
          matchers: []
          routes: []
    

Alertmanager: Watchdog alert

prometheusServer.watchDogAlertEnabled (bool)

Enables or disables the Watchdog alert that constantly fires as long as the entire alerting pipeline is functional. You can use this alert to verify that Alertmanager notifications properly flow to the Alertmanager receivers. Defaults to true.

prometheusServer:
  watchDogAlertEnabled: true

Bond interface monitoring

bondInterfaceMonitoring.enabled (bool)

Enables the bond interface monitoring. Defaults to true.

bondInterfaceMonitoring:
  enabled: true

Byte limit for Telemeter client

For internal StackLight use only

telemetry.telemeterClient.limitBytes (string)

Specifies the size limit of the incoming data length in bytes for the Telemeter client. Defaults to 1048576.

telemetry:
  telemeterClient:
    limitBytes: "1048576"

Cluster size

clusterSize (string)

Specifies the approximate expected cluster size. Possible values: small, medium, large. Defaults to small. Depending on the choice, appropriate resource limits are passed according to the resources or resourcesPerClusterSize parameter.

Caution

The resourcesPerClusterSize parameter is deprecated and will be overridden by the resources parameter. Therefore, use the resources parameter instead.

The values differ by the OpenSearch and Prometheus resource limits:

  • small (default) - 2 CPU, 6 Gi RAM for OpenSearch, 1 CPU, 8 Gi RAM for Prometheus. Use small only for testing and evaluation purposes with no workloads expected.

  • medium - 4 CPU, 16 Gi RAM for OpenSearch, 3 CPU, 16 Gi RAM for Prometheus.

  • large - 8 CPU, 32 Gi RAM for OpenSearch, 6 CPU, 32 Gi RAM for Prometheus. Set to large only in case of lack of resources for OpenSearch and Prometheus.

clusterSize: small

Grafana

grafana.homeDashboard (string)

Defines the home dashboard. Defaults to kubernetes-cluster. You can define any of the available dashboards.

grafana:
  homeDashboard: kubernetes-cluster

High availability

highAvailabilityEnabled (bool) Mandatory

Enables or disables StackLight multiserver mode. For details, see StackLight database modes in Deployment architecture. On MOSK clusters, defaults to false. On management clusters, true is mandatory.

highAvailabilityEnabled: true

Kubernetes network policies

networkPolicies.enabled (bool)

Enables or disables the Kubernetes Network Policy resource that allows controlling network connections to and from Pods deployed in the stackLight namespace. Enabled by default.

For the list of network policy rules, refer to StackLight rules for Kubernetes network policies. Customization of network policies is not supported.

networkPolicies:
  enabled: true

Kubernetes tolerations

  • tolerations.default (slice)

    Kubernetes tolerations to add to all StackLight components.

    tolerations:
      default:
      - key: "com.docker.ucp.manager"
        operator: "Exists"
        effect: "NoSchedule"
    
  • tolerations.component (map)

    Defines Kubernetes tolerations (overrides the default ones) for any StackLight component.

    tolerations:
      component:
        # elasticsearch:
        opensearch:
        - key: "com.docker.ucp.manager"
          operator: "Exists"
          effect: "NoSchedule"
        postgresql:
        - key: "node-role.kubernetes.io/master"
          operator: "Exists"
          effect: "NoSchedule"
    

Log filtering for namespaces

  • logging.namespaceFiltering.logs.enabled (bool)

    Limits the number of namespaces for Pods log collection. Enabled by default with the following list of monitored Kubernetes namespaces:

    Kubernetes namespaces monitored by default
    • ceph
      If Ceph is enabled
    • ceph-lcm-mirantis
      If Ceph is enabled
    • default

    • kaas

    • kube-node-lease

    • kube-public

    • kube-system

    • lcm-system

    • local-path-storage

    • metallb

    • metallb-system

    • node-feature-discovery

    • openstack

    • openstack-ceph-shared
      If Ceph is enabled
    • openstack-lma-shared

    • openstack-provider-system

    • openstack-redis

    • openstack-tf-share
      If Tungsten Fabric is enabled
    • openstack-vault

    • osh-system

    • rook-ceph
      If Ceph is enabled
    • stacklight

    • system

    • tf
      If Tungsten Fabric is enabled
    logging:
      namespaceFiltering:
        logs:
          enabled: true
    
  • logging.namespaceFiltering.logs.extraNamespaces (map)

    Adds extra namespaces to collect Kubernetes Pod logs from. Requires logging.enabled and logging.namespaceFiltering.logs.enabled set to true. Defines a YAML-formatted list of namespaces, which is empty by default.

    logging:
      namespaceFiltering:
        logs:
          enabled: true
          extraNamespaces:
          - custom-ns-1
    
  • logging.namespaceFiltering.events.enabled (bool)

    Limits the number of namespaces for Kubernetes events collection. Disabled by default due to sysdig scanner present on some MOSK clusters and due to cluster-scoped objects producing events by default to the default namespace, but it is not passed to StackLight configuration anyhow. Requires logging.enabled set to true.

    logging:
      namespaceFiltering:
        events:
          enabled: false
    
  • logging.namespaceFiltering.events.extraNamespaces (map)

    Adds extra namespaces to collect Kubernetes events from. Requires logging.enabled and logging.namespaceFiltering.events.enabled set to true. Defines a YAML-formatted list of namespaces, which is empty by default.

    logging:
      namespaceFiltering:
        events:
          enabled: true
          extraNamespaces:
          - custom-ns-1
    

Log verbosity

  • stacklightLogLevels.default (string)

    Defines the log verbosity level for all StackLight components if not defined using component. To use the component default log verbosity level, leave the string empty. Possible values:

    • trace - most verbose log messages, generates large amounts of data

    • debug - messages typically of use only for debugging purposes

    • info - informational messages describing common processes such as service starting or stopping; can be ignored during normal system operation but may provide additional input for investigation

    • warn - messages about conditions that may require attention

    • error - messages on error conditions that prevent normal system operation and require action

    • crit - messages on critical conditions indicating that a service is not working, working incorrectly or is unusable, requiring immediate attention

    The NO_SEVERITY severity label is automatically added to a log with no severity label in the message. This enables greater control over determining which logs Fluentd processes and which ones are skipped by mistake.

    stacklightLogLevels:
      default: ""
    
  • stacklightLogLevels.component (map)

    Defines (overrides the default value) the log verbosity level for any StackLight component separately. To use the component default log verbosity, leave the string empty.

    stacklightLogLevels:
      component:
        kubeStateMetrics: ""
        prometheusAlertManager: ""
        prometheusBlackboxExporter: ""
        prometheusNodeExporter: ""
        prometheusServer: ""
        alerta: ""
        alertmanagerWebhookServicenow: ""
        elasticsearchCurator: ""
        postgresql: ""
        prometheusEsExporter: ""
        sfNotifier: ""
        sfReporter: ""
        fluentd: ""
        # fluentdElasticsearch ""
        fluentdLogs: ""
        telemeterClient: ""
        telemeterServer: ""
        tfControllerExporter: ""
        tfVrouterExporter: ""
        telegrafDs: ""
        telegrafS: ""
        # elasticsearch: ""
        opensearch: ""
        # kibana: ""
        grafana: ""
        opensearchDashboards: ""
        metricbeat: ""
        prometheusMsTeams: ""
    

Logging

  • logging.enabled (bool) Mandatory

    Enables or disables the StackLight logging stack. For details about the logging components, see Deployment architecture. Defaults to true. On management clusters, true is mandatory.

    logging:
      enabled: true
    
  • logging.metricQueries (map)

    Allows configuring OpenSearch queries for the data present in OpenSearch. Prometheus Elasticsearch Exporter then queries the OpenSearch database and exposes such metrics in the Prometheus format. For details, see Create log-based metrics. Includes the following parameters:

    • indices - specifies the index pattern

    • interval and timeout - specify in seconds how often to send the query to OpenSearch and how long it can last before timing out

    • onError and onMissing - modify the prometheus-es-exporter behavior on query error and missing index. For details, see Prometheus Elasticsearch Exporter.

    For usage example, see Create log-based metrics.

Logging: Enforce OOPS compression

logging.enforceOopsCompression

Enforces 32 GB of heap size, unless the defined memory limit allows using 50 GB of heap. Requires logging.enabled set to true. Enabled by default. When disabled, StackLight computes heap as ⅘ of the set memory limit for any resulting heap value. For more details, see Tune OpenSearch performance.

logging:
  enforceOopsCompression: true

Logging to external outputs

logging.externalOutputs (map)

Specifies external Elasticsearch, OpenSearch, and syslog destinations as fluentd-logs outputs. Requires logging.enabled: true. For configuration procedure, see Enable log forwarding to external destinations.

logging:
  externalOutputs:
    elasticsearch:
      # disabled: false
      type: elasticsearch
      level: info
      plugin_log_level: info
      tag_exclude: '{fluentd-logs,systemd}'
      host: elasticsearch-host
      port: 9200
      logstash_date_format: '%Y.%m.%d'
      logstash_format: true
      logstash_prefix: logstash
      ...
      buffer:
        # disabled: false
        chunk_limit_size: 16m
        flush_interval: 15s
        flush_mode: interval
        overflow_action: block
        ...
    opensearch:
      disabled: true
      type: opensearch
      ...

Logging to external outputs: secrets

logging.externalOutputSecretMounts (map)

Specifies authentication secret mounts for external log destinations. Requires logging.externalOutputs to be enabled and a Kubernetes secret to be created under the stacklight namespace. Contains the following values:

  • secretName

    Mandatory. Kubernetes secret name.

  • mountPath

    Mandatory. Mount path of the Kubernetes secret defined in secretName.

  • defaultMode

    Optional. Decimal number defining secret permissions, defaults to 420.

Secret mount configuration:

logging:
  externalOutputSecretMounts:
  - secretName: elasticsearch-certs
    mountPath: /tmp/elasticsearch-certs
    defaultMode: 420
  - secretName: opensearch-certs
    mountPath: /tmp/opensearch-certs

Elasticsearch configuration for the above secret mount:

logging:
  externalOutputs:
    elasticsearch:
      ...
      ca_file: /tmp/elasticsearch-certs/ca.pem
      client_cert: /tmp/elasticsearch-certs/client.pem
      client_key: /tmp/elasticsearch-certs/client.key
      client_key_pass: password

Logging to syslog

Deprecated

Note

The logging.syslog parameter is deprecated in favor of logging.externalOutputs. For details, see Logging to external outputs.

  • logging.syslog.enabled (bool)

    Enables or disables remote logging to syslog. Disabled by default. Requires logging.enabled set to true. For details and configuration example, see Enable remote logging to syslog.

    logging:
      syslog:
        enabled: true
    
  • logging.syslog.host (string)

    Specifies the remote syslog host.

    logging:
      syslog:
        host: remote-syslog.svc
    
  • logging.syslog.port (string)

    Specifies the remote syslog port.

    logging:
      syslog:
        port: "514"
    
  • logging.syslog.packetSize (string)

    Defines the packet size in bytes for the syslog logging output. Defaults to 1024. May be useful for syslog setups allowing packet size larger than 1 kB. Mirantis recommends that you tune this parameter to allow sending full log lines.

    logging:
      syslog:
        packetSize: "1024"
    
  • logging.syslog.protocol (bool)

    Specifies the remote syslog protocol. Defaults to udp. Possible values: tcp or udp.

    logging:
      syslog:
        protocol: udp
    
  • logging.syslog.tls.enabled (bool)

    Optional. Disabled by default. Enables or disables TLS. Use TLS only for the TCP protocol. TLS will not be enabled if you set a protocol other than TCP.

    logging:
      syslog:
        tls:
          enabled: true
    
  • logging.syslog.tls.verify_mode (int)

    Optional. Configures TLS verification. Possible values:

    • 0 for OpenSSL::SSL::VERIFY_NONE

    • 1 for OpenSSL::SSL::VERIFY_PEER

    • 2 for OpenSSL::SSL::VERIFY_FAIL_IF_NO_PEER_CERT

    • 4 for OpenSSL::SSL::VERIFY_CLIENT_ONCE

    logging:
      syslog:
        tls:
          verify_mode: 1
    
  • logging.syslog.tls.certificate (string)

    Defines how to pass the certificate. secret takes precedence over hostPath.

    • secret - specifies the name of the secret holding the certificate.

    • hostPath - specifies an absolute host path to the PEM certificate.

    logging:
      syslog:
        tls:
          certificate:
            secret: ""
            hostPath: "/etc/ssl/certs/ca-bundle.pem"
    
  • tag_exclude (string)

    Optional. Overrides tag_include. Sets logs by tags to exclude from the destination output. For example, to exclude all logs with the test tag, set tag_exclude: '/.*test.*/'.

    How to obtain tags for logs

    Select from the following options:

    • In the main OpenSearch output, use the logger field that equals the tag.

    • Use logs of a particular Pod or container by following the below order, with the first match winning:

      1. The value of the app Pod label. For example, for app=opensearch-master, use opensearch-master as the log tag.

      2. The value of the k8s-app Pod label.

      3. The value of the app.kubernetes.io/name Pod label.

      4. If a release_group Pod label exists and the component Pod label starts with app, use the value of the component label as the tag. Otherwise, the tag is the application label joined to the component label with a -.

      5. The name of the container from which the log is taken.

    The values for tag_exclude and tag_include are placed into <match> directives of Fluentd and only accept regex types that are supported by the <match> directive of Fluentd. For details, refer to the Fluentd official documentation.

    logging:
      syslog:
        tag_exclude: '{fluentd-logs,systemd}'
    
  • tag_include (string)

    Optional. Is overridden by tag_exclude. Sets logs by tags to include to the destination output. For example, to include all logs with the auth tag, set tag_include: '/.*auth.*/'.

    logging:
      syslog:
        tag_include: '/.*auth.*/'
    

Monitoring of Ceph

ceph.enabled (bool)

Enables or disables Ceph monitoring on MOSK clusters. Defaults to false.

ceph:
  enabled: false

Monitoring of external endpoint

  • externalEndpointMonitoring.enabled (bool)

    Enables or disables HTTP endpoints monitoring. If enabled, the monitoring tool performs the probes against the defined endpoints every 15 seconds. Defaults to false.

  • externalEndpointMonitoring.certificatesHostPath (string)

    Defines the directory path with external endpoints certificates on host.

  • externalEndpointMonitoring.domains (slice)

    Defines the list of HTTP endpoints to monitor. The endpoints must successfully respond to a liveness probe. For success, a request to a specific endpoint must result in a 2xx HTTP response code.

externalEndpointMonitoring:
  enabled: false
  certificatesHostPath: /etc/ssl/certs/
  domains:
  - https://prometheus.io/health
  - http://example.com:8080/status
  - http://example.net:8080/pulse

Monitoring of Ironic

  • ironic.endpoint (string)

    Enables or disables monitoring of Ironic. To enable, specify the Ironic API URL.

  • ironic.insecure (bool)

    Defines whether to skip the chain and host verification. Defaults to false.

ironic:
  endpoint: http://ironic-api-http.kaas.svc:6385/v1
  insecure: false

Monitoring of Mirantis Kubernetes Engine

  • mke.enabled (bool)

    Enables or disables Mirantis Kubernetes Engine (MKE) monitoring. Defaults to true.

  • mke.dockerdDataRoot (string)

    Defines the dockerd data root directory of persistent Docker state. For details, see Docker documentation: Daemon CLI (dockerd).

mke:
  enabled: true
  dockerdDataRoot: /var/lib/docker

Monitoring of SSL certificates

  • sslCertificateMonitoring.enabled (bool)

    Enables or disables StackLight to monitor and alert on the expiration date of the TLS certificate of an HTTPS endpoint. If enabled, the monitoring tool performs the probes against the defined endpoints every hour. Defaults to false.

  • sslCertificateMonitoring.domains (slice)

    Defines the list of HTTPS endpoints to monitor the certificates from.

sslCertificateMonitoring:
  enabled: false
  domains:
  - https://prometheus.io
  - https://example.com:8080

Monitoring of workload

metricFilter (map)

On the clusters that run large-scale workloads, workload monitoring generates a big amount of resource-consuming metrics. To prevent generation of excessive metrics, you can disable workload monitoring in the StackLight metrics and monitor only the infrastructure.

The metricFilter parameter enables the cAdvisor (Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. Defaults to false. If set to true, you can define the namespaces to which the filter will apply. The parameter is designed for MOSK clusters.

  • enabled - enable or disable metricFilter using true or false

  • action - action to take by Prometheus:

    • keep - keep only metrics from namespaces that are defined in the namespaces list

    • drop - ignore metrics from namespaces that are defined in the namespaces list

  • namespaces - list of namespaces to keep or drop metrics from regardless of the boolean value for every namespace

metricFilter:
  enabled: true
  action: keep
  namespaces:
  - kaas
  - kube-system
  - stacklight

NodeSelector

  • nodeSelector.default (map)

    Defines the NodeSelector to use for the most of StackLight pods (except some pods that refer to DaemonSets) if the NodeSelector of a component is not defined.

  • nodeSelector.component (map)

    Defines the NodeSelector to use for particular StackLight component pods. Overrides nodeSelector.default.

nodeSelector:
  default:
    role: stacklight
  component:
    alerta:
      role: stacklight
      component: alerta
    # kibana:
    #   role: stacklight
    #   component: kibana
    opensearchDashboards:
      role: stacklight
      component: opensearchdashboards

OpenSearch

  • elasticsearch.persistentVolumeClaimSize (string) Mandatory

    Specifies the OpenSearch (Elasticsearch) Persistent Volume Claim(s) (PVC) size. The number of PVCs depends on the StackLight database mode. For HA, three PVCs will be created, each of the size specified in this parameter. For non-HA, one PVC of the specified size.

    In HA mode, Local Volume Provisioner (LVP) acts as the storage provisioner. All PVCs located on the same node share the same storage pool and perceive the total available capacity as that of the entire LVP disk.

    Important

    You cannot modify this parameter after cluster creation.

  • elasticsearch.persistentVolumeUsableStorageSizeGB (integer)

    Optional. Specifies the number of gigabytes that is exclusively available for the OpenSearch data. Elasticsearch Curator uses this value as the available storage per node, multiplies it by the node count to calculate the total usable space of the cluster, and removes older indices when this capacity is reached. However, this setting does not enforce a per-node storage limit; it serves only for retention size calculation.

    Note

    To limit the maximum storage capacity usage per node, consider using watermark parameters described in OpenSearch extra settings.

    This parameter defines the ceiling for storage-based retention, though only a portion of this storage will be available for indices, depending on the total size and cluster configuration.

    If not set (by default), the number of gigabytes from elasticsearch.persistentVolumeClaimSize is used.

    This parameter is useful in the following cases for HA mode using LVP:

    • The real storage behind the volume is shared between multiple consumers. As a result, OpenSearch cannot use all elasticsearch.persistentVolumeClaimSize.

    • The real volume size is bigger than elasticsearch.persistentVolumeClaimSize. As a result, OpenSearch can use more than elasticsearch.persistentVolumeClaimSize.

elasticsearch:
  persistentVolumeClaimSize: 30Gi
  persistentVolumeUsableStorageSizeGB: 160

OpenSearch Dashboards extra settings

logging.dashboardsExtraConfig (map)

Additional configuration for opensearch_dashboards.yml.

logging:
  dashboardsExtraConfig:
    opensearch.requestTimeout: 60000

OpenSearch extra settings

logging.extraConfig (map)

Additional configuration for opensearch.yml that allows setting various OpenSearch parameters, including logging settings, node watermarks, and other cluster-level configurations.

By default, StackLight manages watermarks efficiently: the low, high, and flood thresholds are set to 15%, 10%, and 5% of the node’s usable storage, respectively. The usable storage is defined by persistentVolumeUsableStorageSizeGB. If this parameter is not set, StackLight uses persistentVolumeClaimSize instead. These percentage values are converted to absolute GB values and capped at 150 GB, 100 GB, and 50 GB, respectively.

If logging.extraConfig sets any watermark, StackLight stops managing them. In this case, explicitly set all watermarks, not only one or two, preferably using absolute GB values. If you decide to use percentage values, make sure you calculate it based on the whole volume available pool size (in the LVP case, volume capacity can be larger than the defined size). For details on watermark settings, refer to official OpenSearch documentation.

logging:
  extraConfig:
    cluster.max_shards_per_node: 5000

Prometheus

  • prometheusServer.alertResendDelay (string)

    Defines the minimum amount of time for Prometheus to wait before resending an alert to Alertmanager. Passed to the --rules.alert.resend-delay flag. For example, 2m or 90s. Defaults to 2m.

  • prometheusServer.alertsCommonLabels (dict)

    Defines the list of labels to be injected to firing alerts while they are sent to Alertmanager. Empty by default.

    The following labels are reserved for internal purposes and cannot be overridden: cluster_id, service, severity.

    Caution

    When new labels are injected, Prometheus sends alert updates with a new set of labels, which can potentially cause Alertmanager to have duplicated alerts for a short period of time if the cluster currently has firing alerts.

    Warning

    Before MOSK management 2.31.0 and MOSK 26.1, do not use the environment label to prevent Alerta from rejecting alert notifications.

  • prometheusServer.persistentVolumeClaimSize (string) Mandatory

    Specifies the Prometheus PVC(s) size. The number of PVCs depends on the StackLight database mode. For HA, three PVCs will be created, each of the size specified in this parameter. For non-HA, one PVC of the specified size.

    Important

    You cannot modify this parameter after cluster creation.

  • prometheusServer.queryConcurrency (string)

    Defines the number of concurrent queries limit. Passed to the --query.max-concurrency flag. Defaults to 20.

  • prometheusServer.retentionSize (string)

    Defines the Prometheus database retention size. Passed to the --storage.tsdb.retention.size flag. For example, 15GB or 512MB. Defaults to 15GB.

  • prometheusServer.retentionTime (string)

    Defines the Prometheus database retention period. Passed to the --storage.tsdb.retention.time flag. For example, 15d, 1000h, or 10d12h. Defaults to 15d.

prometheusServer:
  alertResendDelay: 2m
  alertsCommonLabels:
    region: west
    env: prod
  persistentVolumeClaimSize: 16Gi
  queryConcurrency: 20
  retentionSize: 15GB
  retentionTime: 15d

Prometheus Blackbox Exporter

  • blackboxExporter.customModules (map)

    Specifies a set of custom Blackbox Exporter modules. For details, see Blackbox Exporter configuration: module. The http_2xx, http_2xx_verify, http_openstack, http_openstack_insecure, tls, tls_verify names are reserved for internal usage and any overrides will be discarded.

  • blackboxExporter.timeoutOffset (string)

    Specifies the offset to subtract from timeout in seconds (--timeout-offset), upper bounded by 5.0 to comply with the built-in StackLight functionality. If nothing is specified, the Blackbox Exporter default value is used. For example, for Blackbox Exporter v0.19.0, the default value is 0.5.

blackboxExporter:
  customModules:
    http_post_2xx:
      prober: http
      timeout: 5s
      http:
        method: POST
        headers:
          Content-Type: application/json
        body: '{}'
  timeoutOffset: "0.1"

Prometheus custom recording rules

prometheusServer.customRecordingRules (slice)

Defines custom Prometheus recording rules. Overriding of existing recording rules is not supported.

prometheusServer:
  customRecordingRules:
  - name: ExampleRule.http_requests_total
    rules:
    - expr: sum by(job) (rate(http_requests_total[5m]))
      record: job:http_requests:rate5m
    - expr: avg_over_time(job:http_requests:rate5m[1w])
      record: job:http_requests:rate5m:avg_over_time_1w

Prometheus custom scrape configurations

prometheusServer.customScrapeConfigs (map)

Defines custom Prometheus scrape configurations. For details, see Prometheus documentation: scrape_config. The names of default StackLight scrape configurations, which you can view in the Status -> Targets tab of the Prometheus web UI, are reserved for internal usage and any overrides will be discarded. Therefore, provide unique names to avoid overrides.

prometheusServer:
  customScrapeConfigs:
    custom-grafana:
      scrape_interval: 10s
      scrape_timeout: 5s
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels:
        - __meta_kubernetes_service_label_app
        - __meta_kubernetes_endpoint_port_name
        regex: grafana;service
        action: keep
      - source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod

Prometheus experimental features

Available since MOSK 25.2.2 and MOSK management 2.30.2

prometheusServer.enabledFeatures (slice)

Defines the list of experimental features to enable in Prometheus server. For a list of available features, see Prometheus documentation: Feature Flags.

The memory-snapshot-on-shutdown feature is enabled by default in StackLight.

prometheusServer:
  enabledFeatures:
    - memory-snapshot-on-shutdown
    - use-uncached-io

To disable all experimental features:

prometheusServer:
  enabledFeatures: []

Prometheus metrics filtering

  • metricsFiltering.enabled (bool)

    Configuration for managing Prometheus metrics filtering. When enabled (default), only actively used and explicitly white-listed metrics get scraped by Prometheus.

  • metricsFiltering.extraMetricsInclude (map)

    List of extra metrics to whitelist, which are dropped by default. Contains the following parameters:

    • <job name> - scraping job name as a key for extra white-listed metrics to add under the key. For the list of job names, see White list of Prometheus scrape jobs. If a job name is not present in this list, its target metrics are not dropped and are collected by Prometheus by default.

      You can also use group key names to add metrics to more than one job using _group-<key name>. The following list combines jobs by groups:

      List of jobs by groups
      _group-blackbox-metrics
       - blackbox
       - blackbox-external-endpoint
       - kubernetes-master-api
       - mcc-blackbox
       - mke-manager-api
       - openstack-blackbox-ext
       - openstack-dns-probe
       - refapp
      
      _group-controller-runtime-metrics
       - helm-controller
       - kaas-exporter
       - kubelet
       - kubernetes-apiservers
       - mcc-controllers
       - mcc-providers
       - rabbitmq-operator-metrics
      
      _group-etcd-metrics
       - etcd-server
       - ucp-kv
      
      _group-go-collector-metrics
       - cadvisor
       - calico
       - etcd-server
       - helm-controller
       - ironic
       - kaas-exporter
       - kubelet
       - kubernetes-apiservers
       - mcc-cache
       - mcc-controllers
       - mcc-providers
       - mke-metrics-controller
       - mke-metrics-engine
       - openstack-ingress-controller
       - postgresql
       - prometheus-alertmanager
       - prometheus-elasticsearch-exporter
       - prometheus-grafana
       - prometheus-libvirt-exporter
       - prometheus-memcached-exporter
       - prometheus-msteams
       - prometheus-mysql-exporter
       - prometheus-node-exporter
       - prometheus-rabbitmq-exporter # Removed in MOSK 25.2
       - prometheus-relay
       - prometheus-server
       - rabbitmq-operator-metrics
       - telegraf-docker-swarm
       - telemeter-client
       - telemeter-server
       - tf-control
       - tf-redis
       - tf-vrouter
       - ucp-kv
      
      _group-process-collector-metrics
       - alertmanager-webhook-servicenow
       - cadvisor
       - calico
       - etcd-server
       - helm-controller
       - ironic
       - kaas-exporter
       - kubelet
       - kubernetes-apiservers
       - mcc-cache
       - mcc-controllers
       - mcc-providers
       - mke-metrics-controller
       - mke-metrics-engine
       - openstack-ingress-controller
       - patroni
       - postgresql
       - prometheus-alertmanager
       - prometheus-elasticsearch-exporter
       - prometheus-grafana
       - prometheus-libvirt-exporter
       - prometheus-memcached-exporter
       - prometheus-msteams
       - prometheus-mysql-exporter
       - prometheus-node-exporter
       - prometheus-rabbitmq-exporter # Removed in MOSK 25.2
       - prometheus-relay
       - prometheus-server
       - rabbitmq-operator-metrics
       - sf-notifier
       - telegraf-docker-swarm
       - telemeter-client
       - telemeter-server
       - tf-control
       - tf-redis
       - tf-vrouter
       - tf-zookeeper
       - ucp-kv
      
      _group-rest-client-metrics
       - helm-controller
       - kaas-exporter
       - mcc-controllers
       - mcc-providers
      
      _group-service-handler-metrics
       - mcc-controllers
       - mcc-providers
      
      _group-service-http-metrics
       - mcc-cache
       - mcc-controllers
      
      _group-service-reconciler-metrics
       - mcc-controllers
       - mcc-providers
      
    • <list of metrics to collect> - extra metrics of <job name> to be white-listed.

prometheusServer:
  metricsFiltering:
    enabled: true
    extraMetricsInclude:
      cadvisor:
        - container_memory_failcnt
        - container_network_transmit_errors_total
      calico:
        - felix_route_table_per_iface_sync_seconds_sum
        - felix_bpf_dataplane_endpoints
      _group-go-collector-metrics:
        - go_gc_heap_goal_bytes
        - go_gc_heap_objects_objects

Prometheus Node Exporter

  • nodeExporter.netDeviceExclude (string) Deprecated

    Deprecated for the sake of extraArgs. Excludes monitoring of RegExp-specified network devices. The number of network interface-related metrics is significant and may cause extended Prometheus RAM usage in big clusters. Therefore, Prometheus Node Exporter only collects information of a basic set of interfaces (both host and container) and excludes the following monitoring interfaces:

    • veth/cali - the host-side part of the container-host Ethernet tunnel

    • o-hm0 - the OpenStack Octavia management interface for communication with the amphora machine

    • tap, qg-, qr-, ha- - the Open vSwitch virtual bridge ports

    • br-(ex|int|tun) - the Open vSwitch virtual bridges

    • docker0, br- - the Docker bridge (master for the veth interfaces)

    • ovs-system - the Open vSwitch interface (mapping interfaces to bridges)

    • vxlan_sys (since 2.31.2 and 25.2.7) - the shared kernel VXLAN interface used internally by Open vSwitch

    To enable information collecting for the interfaces above, edit the list of blacklisted devices as needed.

    • Since 2.31.2 and 25.2.7:

      nodeExporter:
        netDeviceExclude: "^(veth.+|cali.+|o-hm0|tap.+|qg-.+|qr-.+|ha-.+|br-.+|ovs-system|docker0|vxlan_sys)$"
      
    • Before 2.31.2 and 25.2.7:

      nodeExporter:
        netDeviceExclude: "^(veth.+|cali.+|o-hm0|tap.+|qg-.+|qr-.+|ha-.+|br-.+|ovs-system|docker0)$"
      
  • nodeExporter.extraCollectorsEnabled (slice) Deprecated

    Deprecated for the sake of extraArgs. Enables Node Exporter collectors. For a list of available collectors, see Node Exporter Collectors. The following collectors are enabled by default in StackLight:

    • arp

    • conntrack [0]

    • cpu

    • cpu.info

    • diskstats

    • entropy

    • filefd

    • filesystem

    • hwmon

    • loadavg

    • meminfo

    • netdev

    • netstat

    • nfs

    • stat

    • sockstat

    • textfile

    • time

    • timex

    • uname

    • vmstat

    nodeExporter:
      extraCollectorsEnabled:
        - bcache
        - bonding
        - softnet
    
  • nodeExporter.extraArgs (map)

    Additional command-line arguments passed to Node Exporter. This field has the highest priority when merging with default and other user-provided arguments, including nodeExporter.netDeviceExclude and nodeExporter.extraCollectorsEnabled (both are deprecated).

    The value should be a map of flags to values. For boolean flags, use an empty string ("") as the value.

    nodeExporter:
      extraArgs:
        collector.filesystem.ignored-mount-points: "^/(dev|proc|sys|var/lib/docker/.+)($|/)"
        collector.netdev.device-exclude: "lo"
        collector.cpu: ""
    
  • nodeExporter.extraVolumes (slice)

    Additional volumes to be mounted into the Node Exporter pod. Do not define volumes using the following names, as they are already defined by default. Duplicating them will break the following Helm upgrades: rootfs, proc, and sys.

    nodeExporter:
      extraVolumes:
        - name: custom-mount
          hostPath:
            path: /custom
            type: Directory
    
  • nodeExporter.extraVolumeMounts (slice)

    Additional volume mounts to be added to the Node Exporter container. Do not define mounts using the following mountPaths, as they are already included by default. Adding them here will result in duplicate definitions and cause Helm upgrade failures:

    • /host/rootfs

    • /host/proc

    • /host/sys

    nodeExporter:
      extraVolumeMounts:
        - name: custom-mount
          mountPath: /host/custom
          readOnly: true
    

Prometheus Relay

Note

Prometheus Relay is set up as an endpoint in the Prometheus datasource in Grafana. Therefore, all requests from Grafana are sent to Prometheus through Prometheus Relay. If Prometheus Relay reports request timeouts or exceeds the response size limits, you can configure the parameters below. In this case, Prometheus Relay resource limits may also require tuning.

  • prometheusRelay.clientTimeout (string)

    Specifies the client timeout in seconds. If empty, defaults to a value determined by the cluster size: 10 for small, 30 for medium, 60 for large.

  • prometheusRelay.responseLimitBytes (string)

    Specifies the response size limit in bytes. If empty, defaults to a value determined by the cluster size: 6291456 for small, 18874368 for medium, 37748736 for large.

prometheusRelay:
  clientTimeout: 10
  responseLimitBytes: 1048576

Prometheus remote write

Allows sending of metrics from Prometheus to a custom monitoring endpoint. For details, see Prometheus Documentation: remote_write.

  • prometheusServer.remoteWriteSecretMounts (slice)

    Skip this step if your remote server does not have authorization. Defines additional mounts for remoteWrites secrets. Secret objects with credentials needed to access the remote endpoint must be precreated in the stacklight namespace. For details, see Kubernetes Secrets.

    Note

    To create more than one file for the same remote write endpoint, for example, to configure TLS connections, use a single secret object with multiple keys in the data field. Using the following example configuration, two files will be created, cert_file and key_file:

    ...
      data:
        cert_file: aWx1dnRlc3Rz
        key_file: dGVzdHVzZXI=
    ...
    
  • prometheusServer.remoteWrites (slice)

    Defines the configuration of a custom remote_write endpoint for sending Prometheus samples.

    Note

    If the remote server uses authorization, first create secret(s) in the stacklight namespace and mount them to Prometheus through prometheusServer.remoteWriteSecretMounts. Then define the created secret in the authorization field.

prometheusServer:
  remoteWriteSecretMounts:
  - secretName: prom-secret-files
    mountPath: /etc/config/remote_write
  remoteWrites:
  - url: http://remote_url/push
    authorization:
      credentials_file: /etc/config/remote_write/key_file

Resource limits

  • resourcesPerClusterSize (map) Deprecated

    Provides the capability to override the default resource requests or limits for any StackLight component for the predefined cluster sizes.

    Caution

    The resourcesPerClusterSize parameter is deprecated and is overridden by the resources parameter. Therefore, use the resources parameter instead.

    StackLight components for resource limits customization

    Note

    The below list has the componentName: <podNamePrefix>/<containerName> format.

    alerta: alerta/alerta
    alertmanager: prometheus-alertmanager/prometheus-alertmanager
    alertmanagerWebhookServicenow: alertmanager-webhook-servicenow/alertmanager-webhook-servicenow
    blackboxExporter: prometheus-blackbox-exporter/blackbox-exporter
    elasticsearch: opensearch-master/opensearch # Deprecated
    elasticsearchCurator: elasticsearch-curator/elasticsearch-curator
    elasticsearchExporter: elasticsearch-exporter/elasticsearch-exporter
    fluentdElasticsearch: fluentd-logs/fluentd-logs # Deprecated
    fluentdLogs: fluentd-logs/fluentd-logs
    fluentdNotifications: fluentd-notifications/fluentd
    grafana: grafana/grafana
    iamProxy: iam-proxy/iam-proxy # Deprecated
    iamProxyAlerta: iam-proxy-alerta/iam-proxy
    iamProxyAlertmanager: iam-proxy-alertmanager/iam-proxy
    iamProxyGrafana: iam-proxy-grafana/iam-proxy
    iamProxyKibana: iam-proxy-kibana/iam-proxy # Deprecated
    iamProxyOpenSearchDashboards: iam-proxy-kibana/iam-proxy
    iamProxyPrometheus: iam-proxy-prometheus/iam-proxy
    kibana: opensearch-dashboards/opensearch-dashboards # Deprecated
    kubeStateMetrics: prometheus-kube-state-metrics/prometheus-kube-state-metrics
    libvirtExporter: prometheus-libvirt-exporter/prometheus-libvirt-exporter
    metricCollector: metric-collector/metric-collector
    metricbeat: metricbeat/metricbeat
    nodeExporter: prometheus-node-exporter/prometheus-node-exporter
    opensearch: opensearch-master/opensearch
    opensearchDashboards: opensearch-dashboards/opensearch-dashboards
    patroniExporter: patroni/patroni-patroni-exporter
    pgsqlExporter: patroni/patroni-pgsql-exporter
    postgresql: patroni/patroni
    prometheusEsExporter: prometheus-es-exporter/prometheus-es-exporter
    prometheusMsTeams: prometheus-msteams/prometheus-msteams
    prometheusRelay: prometheus-relay/prometheus-relay
    prometheusServer: prometheus-server/prometheus-server
    sfNotifier: sf-notifier/sf-notifier
    sfReporter: sf-reporter/sf-reporter
    stacklightHelmControllerController: stacklight-helm-controller/controller
    telegrafDockerSwarm: telegraf-docker-swarm/telegraf-docker-swarm
    telegrafDs: telegraf-ds-smart/telegraf-ds-smart # Deprecated
    telegrafDsSmart: telegraf-ds-smart/telegraf-ds-smart
    telegrafS: telegraf-docker-swarm/telegraf-docker-swarm # Deprecated
    telemeterClient: telemeter-client/telemeter-client
    telemeterServer: telemeter-server/telemeter-server
    telemeterServerAuthServer: telemeter-server/telemeter-server-authorization-server
    tfControllerExporter: prometheus-tf-controller-exporter/prometheus-tungstenfabric-exporter
    tfVrouterExporter: prometheus-tf-vrouter-exporter/prometheus-tungstenfabric-exporter
    
    resourcesPerClusterSize:
      # elasticsearch:
      opensearch:
        small:
          limits:
            cpu: "1000m"
            memory: "4Gi"
        medium:
          limits:
            cpu: "2000m"
            memory: "8Gi"
          requests:
            cpu: "1000m"
            memory: "4Gi"
        large:
          limits:
            cpu: "4000m"
            memory: "16Gi"
    
  • resources (map)

    Provides the capability to override the containers resource requests or limits for any StackLight component.

    StackLight components for resource limits customization

    Note

    The below list has the componentName: <podNamePrefix>/<containerName> format.

    alerta: alerta/alerta
    alertmanager: prometheus-alertmanager/prometheus-alertmanager
    alertmanagerWebhookServicenow: alertmanager-webhook-servicenow/alertmanager-webhook-servicenow
    blackboxExporter: prometheus-blackbox-exporter/blackbox-exporter
    elasticsearch: opensearch-master/opensearch # Deprecated
    elasticsearchCurator: elasticsearch-curator/elasticsearch-curator
    elasticsearchExporter: elasticsearch-exporter/elasticsearch-exporter
    fluentdElasticsearch: fluentd-logs/fluentd-logs # Deprecated
    fluentdLogs: fluentd-logs/fluentd-logs
    fluentdNotifications: fluentd-notifications/fluentd
    grafana: grafana/grafana
    iamProxy: iam-proxy/iam-proxy # Deprecated
    iamProxyAlerta: iam-proxy-alerta/iam-proxy
    iamProxyAlertmanager: iam-proxy-alertmanager/iam-proxy
    iamProxyGrafana: iam-proxy-grafana/iam-proxy
    iamProxyKibana: iam-proxy-kibana/iam-proxy # Deprecated
    iamProxyOpenSearchDashboards: iam-proxy-kibana/iam-proxy
    iamProxyPrometheus: iam-proxy-prometheus/iam-proxy
    kibana: opensearch-dashboards/opensearch-dashboards # Deprecated
    kubeStateMetrics: prometheus-kube-state-metrics/prometheus-kube-state-metrics
    libvirtExporter: prometheus-libvirt-exporter/prometheus-libvirt-exporter
    metricCollector: metric-collector/metric-collector
    metricbeat: metricbeat/metricbeat
    nodeExporter: prometheus-node-exporter/prometheus-node-exporter
    opensearch: opensearch-master/opensearch
    opensearchDashboards: opensearch-dashboards/opensearch-dashboards
    patroniExporter: patroni/patroni-patroni-exporter
    pgsqlExporter: patroni/patroni-pgsql-exporter
    postgresql: patroni/patroni
    prometheusEsExporter: prometheus-es-exporter/prometheus-es-exporter
    prometheusMsTeams: prometheus-msteams/prometheus-msteams
    prometheusRelay: prometheus-relay/prometheus-relay
    prometheusServer: prometheus-server/prometheus-server
    sfNotifier: sf-notifier/sf-notifier
    sfReporter: sf-reporter/sf-reporter
    stacklightHelmControllerController: stacklight-helm-controller/controller
    telegrafDockerSwarm: telegraf-docker-swarm/telegraf-docker-swarm
    telegrafDs: telegraf-ds-smart/telegraf-ds-smart # Deprecated
    telegrafDsSmart: telegraf-ds-smart/telegraf-ds-smart
    telegrafS: telegraf-docker-swarm/telegraf-docker-swarm # Deprecated
    telemeterClient: telemeter-client/telemeter-client
    telemeterServer: telemeter-server/telemeter-server
    telemeterServerAuthServer: telemeter-server/telemeter-server-authorization-server
    tfControllerExporter: prometheus-tf-controller-exporter/prometheus-tungstenfabric-exporter
    tfVrouterExporter: prometheus-tf-vrouter-exporter/prometheus-tungstenfabric-exporter
    
    resources:
      alerta:
        requests:
          cpu: "50m"
          memory: "200Mi"
        limits:
          memory: "500Mi"
    

    Using the example above, each pod in the alerta service will be requesting 50 millicores of CPU and 200 MiB of memory, while being hard-limited to 500 MiB of memory usage. Each configuration key is optional.

    Note

    The logging mechanism performance depends on the cluster log load. If the cluster components send an excessive amount of logs, the default resource requests and limits for fluentdLogs (or fluentdElasticsearch) may be insufficient, which may cause its pods to be OOMKilled and trigger the KubePodCrashLooping alert. In such case, increase the default resource requests and limits for fluentdLogs. For example:

    resources:
      # fluentdElasticsearch:
      fluentdLogs:
        requests:
          memory: "500Mi"
        limits:
          memory: "1500Mi"
    

Salesforce reporter

On MOSK clusters with limited internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce reporter depends on the internet access through HTTPS.

  • clusterId (string)

    Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-reporter and sf-notifier services.

    The clusterId key is automatically defined for each cluster. Do not set or modify it manually.

  • sfReporter.enabled (bool)

    Enables or disables reporting of Prometheus metrics to Salesforce. For details, see Deployment architecture. Disabled by default.

  • sfReporter.salesForceAuth (map)

    Salesforce parameters and credentials for the metrics reporting integration.

    Note

    Modify this parameter if sf-notifier is not configured or if you want to use a different Salesforce user account to send reports to.

  • sfReporter.cronjob (map)

    Defines the Kubernetes cron job for sending metrics to Salesforce. By default, reports are sent at midnight server time.

sfReporter:
  enabled: false
  salesForceAuth:
    url: "<SF instance URL>"
    username: "<SF account email address>"
    password: "<SF password>"
    environment_id: "<Cloud identifier>"
    organization_id: "<Organization identifier>"
    sandbox_enabled: "<Set to true or false>"
  cronjob:
    schedule: "0 0 * * *"
    concurrencyPolicy: "Allow"
    failedJobsHistoryLimit: ""
    successfulJobsHistoryLimit: ""
    startingDeadlineSeconds: 200

Storage class

In an HA StackLight setup, when highAvailabilityEnabled is set to true, all StackLight Persistent Volumes (PVs) use the Local Volume Provisioner (LVP) storage class not to rely on dynamic provisioners such as Ceph, which are not available in every deployment. In a non-HA StackLight setup, when no storage class is specified, PVs use the default storage class of a cluster.

  • storage.defaultStorageClass (string)

    Defines the StorageClass to use for all StackLight Persistent Volume Claims (PVCs) if a component StorageClass is not defined using the componentStorageClasses. For example, lvp or standard. To use the default storage class, leave the string empty.

  • storage.componentStorageClasses (map)

    Defines (overrides the defaultStorageClass value) the storage class for any StackLight component separately. To use the default storage class, leave the string empty.

storage:
  defaultStorageClass: ""
  componentStorageClasses:
    elasticsearch: ""
    opensearch: ""
    fluentd: ""
    postgresql: ""
    prometheusAlertManager: ""
    prometheusServer: ""

Telegraf S.M.A.R.T

  • telegrafSmart.enabled (bool)

    Enables the Telegraf S.M.A.R.T input plugin. Enabled by default.

  • telegrafSmart.configParameters (map)

    Configuration block for the S.M.A.R.T input plugin. The list of the supported parameters includes:

    • enable_extensions

    • tag_with_device_type

    • attributes (default: true)

    • interval (default: "30s")

    • excludes

    • devices

    • timeout

    Warning

    When the devices parameter is defined, all specified devices are selected on each node. If the listed device is not present on a given node, Telegraf reports metric smart_device_exit_status{device=...} with a value of 2 indicating the non-successful exit status.

    Refer to the official Telegraf S.M.A.R.T Input Plugin documentation for details and expected formats.

telegrafSmart:
  enabled: true
  configParameters:
    enable_extensions:
      - "auto-on"
    excludes:
      - "/dev/sdb"
    attributes: true
    tag_with_device_type: false
    interval: "30s"
    timeout: "25s"