Create log-based metrics¶

StackLight provides a vast variety of metrics for MOSK components. However, you may need to create a custom log-based metric to use it for alert notifications, for example, in the following cases:

If a component producing logs does not expose scraping targets. In this case, component-specific metrics may be missing.
If a scraping target lacks information that can be collected by aggregating the log messages.
If alerting reasons are more explicitly presented in log messages.

For example, you want to receive alert notifications when more than 10 cases are created in Salesforce within an hour. The sf-notifier scraping endpoint does not expose such information. However, sf-notifier logs are stored in OpenSearch and using prometheus-es-exporter you can perform the following:

Configure a query using Query DSL (Domain Specific Language) and test it in Dev Tools in OpenSearch Dashboards.
Configure Prometheus Elasticsearch Exporter to expose the result as a Prometheus metric showing the total amount of Salesforce cases created daily, for example, salesforce_cases_daily_total_value.
Configure StackLight to send a notification once the value of this metric increases by 10 or more within an hour.

Caution

StackLight logging must be enabled and functional.
Prometheus-es-exporter uses OpenSearch Search API. Therefore, configured queries must be tuned for this specific API and must include:
- The query part to filter documents
- The aggregation part to combine filtered documents into a metric-oriented result
For details, see Supported Aggregations.

The following procedure is based on the salesforce_cases_daily_total_value metric described in the example above.

Create a custom log-based metric¶

Perform steps 1-2 as described in Configure StackLight.
In the manifest that opens, verify that StackLight logging is enabled:
```
logging:
  enabled: true
```

Create a query using Query DSL:

Select one of the following options:
Since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)
In the OpenSearch Dashboards web UI, select an index to query. StackLight stores logs in hourly OpenSearch indices.

Note

Optimize the query time by limiting the number of results. For example, we will use the OpenSearch event.provider field set to sf-notifier to limit the number of logs to search.

For example:
GET system/_search { "query": { "bool": { "filter": [ { "term": { "event.provider": { "value": "sf-notifier" } } }, { "range": { "@timestamp": { "gte": "now/d" } } } ] } } }
Before Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)
In the OpenSearch Dashboards web UI, select an index to query. StackLight stores logs in hourly OpenSearch indices. To select all indices for a day, use the <logstash-{now/d}*> index pattern, which stands for %3Clogstash-%7Bnow%2Fd%7D*%3E when URL-encoded.

Note

Optimize the query time by limiting the number of results. For example, we will use the OpenSearch logger field set to sf-notifier to limit the number of logs to search.

For example:
GET /%3Clogstash-%7Bnow%2Fd%7D*%3E/_search { "query": { "bool": { "must": { "term": { "logger": { "value": "sf-notifier" } } } } } }
Test the query in Dev Tools in OpenSearch Dashboards.
Select the log lines that include information about Salesforce cases creation. For the info logging level, to indicate case creation, sf-notifier produces log messages similar to the following one:
```
[2021-07-02 12:35:28,596] INFO in client: Created case: OrderedDict([('id', '5007h000007iqmKAAQ'), ('success', True), ('errors', [])]).
```
Such log messages include the Created case phrase. Use it in the query to filter log messages for created cases:
```
"filter": {
  "match_phrase_prefix" : {
    "message" : "Created case"
  }
}
```

Combine the query result to a single value that prometheus-es-exporter will expose as a metric. Use the value_count aggregation:

Since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)

GET system/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "event.provider": {
              "value": "sf-notifier"
            }
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now/d"
            }
          }
        },
        {
          "match_phrase_prefix" : {
            "message" : "Created case"
          }
        }
      ]
    }
  },
  "aggs" : {
    "daily_total": {
      "value_count": {
        "field" : "event.provider"
      }
    }
  }
}

Before Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)

GET /%3Clogstash-%7Bnow%2Fd%7D*%3E/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "logger": {
            "value": "sf-notifier"
          }
        }
      },
      "filter": {
        "match_phrase_prefix" : {
          "message" : "Created case"
        }
      }
    }
  },
  "aggs" : {
    "daily_total": {
      "value_count": {
        "field" : "logger"
      }
    }
  }
}

The aggregation result in Dev Tools should look as follows:

"aggregations" : {
  "daily_total" : {
    "value" : 19
  }
}

Note

The metric name is suffixed with the aggregation name and the result field name: salesforce_cases_daily_total_value.

Configure Prometheus Elasticsearch Exporter:

In StackLight values of the cluster resource, specify the new metric using the logging.metricQueries parameter and configure the query parameters as described in StackLight configuration parameters: logging.metricQueries.

In the example below, salesforce_cases is the query name. The final metric name can be generalized using the <query_name>_<aggregation_name>_<aggregation_result_field_name> template.

Since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)

logging:
  metricQueries:
    salesforce_cases:
      indices: system
      interval: 600
      timeout: 60
      onError: preserve
      onMissing: zero
      body: '{"query":{"bool":{"filter":[{"term":{"event.provider":{"value":"sf-notifier"}}},{"range":{"@timestamp":{"gte":"now/d"}}},{"match_phrase_prefix":{"message":"Created case"}}]}},"aggs":{"daily_total":{"value_count":{"field":"event.provider"}}}}'

Before Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)

logging:
  metricQueries:
    salesforce_cases:
      indices: '<logstash-{now/d}*>'
      interval: 600
      timeout: 60
      onError: preserve
      onMissing: zero
      body: '{"query":{"bool":{"must":{"term":{"logger":{"value":"sf-notifier"}}},"filter":{"match_bool_prefix":{"message":"Created case"}}}},"aggs":{"daily_total":{"value_count":{"field":"logger"}}}}'

Note

Convert your query into a one-liner and wrap it in single quotes when adding to body.

Verify that the prometheus-es-exporter ConfigMap has been updated:

kubectl describe cm -n stacklight prometheus-es-exporter

Example of system response:

Since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)

QueryOnError = preserve
QueryOnMissing = zero
QueryJson = "{\"aggs\":{\"component\":{\"terms\":{\"field\":\"event.provider\"}}},\"query\":{\"match_all\":{}},\"size\":0}"
[query_salesforce_cases]
QueryIntervalSecs = 600
QueryTimeoutSecs = 60
QueryIndices = system
QueryOnError = preserve
QueryOnMissing = zero
QueryJson = "{\"query\":{\"bool\":{\"filter\":[{\"term\":{\"event.provider\":{\"value\":\"sf-notifier\"}}},{\"range\":{\"@timestamp\":{\"gte\":\"now/d\"}}},{\"match_phrase_prefix\":{\"message\":\"Created case\"}}]}},\"aggs\":{\"daily_total\":{\"value_count\":{\"field\":\"event.provider\"}}}}"

Events:  <none>

Before Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0)

QueryOnError = preserve
QueryOnMissing = zero
QueryJson = "{\"aggs\":{\"component\":{\"terms\":{\"field\":\"logger\"}}},\"query\":{\"match_all\":{}},\"size\":0}"
[query_salesforce_cases]
QueryIntervalSecs = 600
QueryTimeoutSecs = 60
QueryIndices = <logstash-{now/d}*>
QueryOnError = preserve
QueryOnMissing = zero
QueryJson = "{\"query\":{\"bool\":{\"must\":{\"term\":{\"logger\":{\"value\":\"sf-notifier\"}}},\"filter\":{\"match_phrase_prefix\":{\"message\":\"Created case\"}}}},\"aggs\":{\"daily_total\":{\"value_count\":{\"field\":\"logger\"}}}}"

Events:  <none>

ConfigMap update triggers the prometheus-es-exporter pod restart.

Verify that the newly configured query has been executed.

kubectl logs -f -n stacklight <prometheus-es-exporter-pod-id>

Example of system response:

[...]
[2021-08-04 12:08:51,989] opensearch.info MainThread POST http://opensearch-master:9200/%3Cnotification-%7Bnow%2Fd%7D%3E/_search [status:200 request:0.040s]
[2021-08-04 12:08:52,089] opensearch.info MainThread POST http://opensearch-master:9200/%3Cnotification-%7Bnow%2Fd%7D%3E/_search [status:200 request:0.100s]
[2021-08-04 12:08:54,469] opensearch.info MainThread POST http://opensearch-master:9200/%3Csystem-%7Bnow%2Fd%7D*%3E/_search [status:200 request:2.278s]

Once done, prometheus-es-exporter will expose metrics from Prometheus in its scraping endpoint. You can view the new metric in the Prometheus web UI.

Optional. Configure StackLight notifications:

Add a new alert as described in Alert configuration. For example:

prometheusServer:
  customAlerts:
  - alert: SalesforceCasesDailyWarning
    annotations:
      description: The number of cases created today in Salesforce increased by 10 within the last hour.
      summary: Too many cases in Salesforce
    expr: increase(salesforce_cases_daily_total_value[1h]) >= 10
    labels:
      severity: warning
      service: custom

Configure receivers as described in StackLight configuration parameters. For example, to send alert notifications to Slack only:

alertmanagerSimpleConfig:
  slack:
    enabled: true
    api_url: https://hooks.slack.com/services/i45f3k3/w3bh00kU9L/06vi0u5ly
    channel: Slackbot
    route:
      match:
        alertname: SalesforceCasesDailyWarning
  salesForce:
    enabled: true
    route:
      routes:
        - receiver: HTTP-slack
          match:
          - alertname: SalesforceCasesDailyWarning

Parse and extract fields from log messages¶

For complex monitoring scenarios, you may need to extract specific information from unstructured log messages and use it as metric labels. This is particularly useful when log messages contain important identifiers or context that is not available as structured fields.

Example: Extracting instance UUIDs from nova-compute warning logs¶

Consider a scenario where you want to monitor PCI slot warnings per instance and host. The raw log messages contain instance UUIDs embedded in the text, but these are not available as structured fields. For example:

2025-11-06 14:00:58.443 1 WARNING nova.compute.manager [None req-2647ed33-54be-4cbc-843e-ef2afbe1f393 18559bc87f394d8fbdcbc4aae5ca4565 5809824ec0524f7ea0c0a633eefcc372 - - default default] [instance: c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d] attach interface failed , try to deallocate port 1038bf80-1002-438b-aa64-117124e36e79, reason: Instance c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d has no free PCI slots available: nova.exception.NoPciSlots: Instance c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d has no free PCI slots available

Example query to extract instance UUIDs¶

The following dropdown example contains an advanced query that extracts instance UUIDs from the example log above:

In this example, the query breakdown is as follows:

Query section	Description
Message content filtering	Uses `multi_match` with phrase matching to find log messages containing has no free PCI slots available. This targets specific PCI slot error messages.
Container filtering	Filters to only the `nova-compute` container logs using `match_phrase` on the `container.name` field.
Log level filtering	Restricts results to `warning` level logs using `match_phrase` on the `log.level` field.
Hostname validation	Ensures that the log has a valid `host.hostname` field using the `exists` filter.
Time range filtering	Limits results to the last hour using the `range` filter on `@timestamp`.
UUID extraction	Uses a Painless script with the regex pattern `\\[instance: ([0-9a-f-]{36})\\]` to extract instance UUIDs from log messages such as `[instance: c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d]`.
Aggregation	Groups results by extracted instance UUID, then by hostname, creating hierarchical metric labels for monitoring per instance and node.

Example configuration for prometheus-es-exporter¶

The following configuration example for prometheus-es-exporter creates metrics with labels for each instance UUID and hostname combination. This allows you to monitor per-instance and per-node PCI slot warnings.

Configuration example:

logging:
  metricQueries:
    pci_no_free_slots_by_instance:
      indices: system
      interval: 300
      timeout: 60
      onError: preserve
      onMissing: zero
      body: '{"size":0,"timeout":"30s","query":{"bool":{"filter":[{"multi_match":{"type":"phrase","query":"has no free PCI slots available","lenient":true}},{"match_phrase":{"container.name":"nova-compute"}},{"match_phrase":{"log.level":"warning"}},{"exists":{"field":"host.hostname"}},{"range":{"@timestamp":{"gte":"now-1h"}}}]}},"aggs":{"instance_uuid":{"terms":{"script":{"source":"if (params._source.message != null) { String message = params._source.message; def pattern = /\\[instance: ([0-9a-f-]{36})\\]/; def matcher = pattern.matcher(message); if (matcher.find()) { return matcher.group(1); } } return null;","lang":"painless"},"size":100,"min_doc_count":1},"aggs":{"node":{"terms":{"field":"host.hostname","size":20}}}}}}'

Note

Convert your query into a one-liner and wrap it in single quotes when adding to body.

Example metrics¶

pci_no_free_slots_by_instance_hits 1.0
pci_no_free_slots_by_instance_took_milliseconds 2450.0
pci_no_free_slots_by_instance_instance_uuid_doc_count_error_upper_bound 0.0
pci_no_free_slots_by_instance_instance_uuid_sum_other_doc_count 0.0
pci_no_free_slots_by_instance_instance_uuid_doc_count{instance_uuid="c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d"} 1.0
pci_no_free_slots_by_instance_instance_uuid_node_doc_count_error_upper_bound{instance_uuid="c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d"} 0.0
pci_no_free_slots_by_instance_instance_uuid_node_sum_other_doc_count{instance_uuid="c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d"} 0.0
pci_no_free_slots_by_instance_instance_uuid_node_doc_count{instance_uuid="c15ed8f2-327f-49c1-8a6e-e6bbbb38f67d",node="ag-ps-jvoz2dv7voyt-0-xbrnwdxaxymd-server-ck7ro5cbfmxj"} 1.0

No results

An error occurred

Create log-based metrics¶

Create a custom log-based metric¶

Parse and extract fields from log messages¶

Example: Extracting instance UUIDs from nova-compute warning logs¶

Example query to extract instance UUIDs¶

Example configuration for prometheus-es-exporter¶

Example metrics¶