You can enable exposing metrics that are based on the log events. This allows
monitoring of various activities such as disk failures (metric
hdd_errors_total
). By default, Fluentd generates metrics from the logs it
gathers. However, you must configure Fluentd to expose such metrics to
Prometheus. Prometheus gathers Fluentd metrics as a static Prometheus endpoint.
For details, see Add a custom monitoring endpoint. To generate metrics
from logs, StackLight LMA uses the
fluent-plugin-prometheus
plugin.
To configure Fluentd to expose metrics generated from logs:
Log in to the Salt Master node.
Add the following class to the cluster/<cluster_name>/init.yml
file of
the Reclass model:
system.fluentd.label.default_metric.prometheus
This class creates a new label default_metric
that is used as a generic
interface to expose new metrics to Prometheus.
(Optional) Create a filter for metric.metric_name
to generate the
metric.
Example:
reclass:
fluentd:
agent:
label:
default_metric:
filter:
metric_out_of_memory:
tag: metric.out_of_memory
type: prometheus
metric:
- name: out_of_memory_total
type: counter
desc: The total number of OOM.
label:
- name: host
value: ${Hostname}
metric_hdd_errors_parse:
tag: metric.hdd_errors
type: parser
key_name: Payload
parser:
type: regexp
format: '/(?<device>[sv]d[a-z]+\d*)/'
metric_hdd_errors:
tag: metric.hdd_errors
require:
- metric_hdd_errors_parse
type: prometheus
metric:
- name: hdd_errors_total
type: counter
desc: The total number of hdd errors.
label:
- name: host
value: ${Hostname}
- name: device
value: ${device}
systemd:
output:
push_to_default:
tag: '*.systemd'
type: copy
store:
- type: relabel
label: default_output
- type: rewrite_tag_filter
rule:
- name: Payload
regexp: '^Out of memory'
result: metric.out_of_memory
- name: Payload
regexp: >-
'error.+[sv]d[a-z]+\d*'
result: metric.hdd_errors
- name: Payload
regexp: >-
'[sv]d[a-z]+\d*.+error'
result: metric.hdd_errors
push_to_metric:
tag: 'metric.**'
type: relabel
label: default_metric