Enable Fluentd to expose metrics generated from logs

Enable Fluentd to expose metrics generated from logs

You can enable exposing metrics that are based on the log events. This allows monitoring of various activities such as disk failures (metric hdd_errors_total). By default, Fluentd generates metrics from the logs it gathers. However, you must configure Fluentd to expose such metrics to Prometheus. Prometheus gathers Fluentd metrics as a static Prometheus endpoint. For details, see Add a custom monitoring endpoint. To generate metrics from logs, StackLight LMA uses the fluent-plugin-prometheus plugin.

To configure Fluentd to expose metrics generated from logs:

  1. Log in to the Salt Master node.

  2. Add the following class to the cluster/<cluster_name>/init.yml file of the Reclass model:

    system.fluentd.label.default_metric.prometheus
    

    This class creates a new label default_metric that is used as a generic interface to expose new metrics to Prometheus.

  3. (Optional) Create a filter for metric.metric_name to generate the metric.

    Example:

    reclass:
    fluentd:
      agent:
        label:
          default_metric:
            filter:
              metric_out_of_memory:
                tag: metric.out_of_memory
                type: prometheus
                metric:
                  - name: out_of_memory_total
                    type: counter
                    desc: The total number of OOM.
                label:
                  - name: host
                    value: ${Hostname}
              metric_hdd_errors_parse:
                tag: metric.hdd_errors
                type: parser
                key_name: Payload
                parser:
                  type: regexp
                  format: '/(?<device>[sv]d[a-z]+\d*)/'
              metric_hdd_errors:
                tag: metric.hdd_errors
                require:
                  - metric_hdd_errors_parse
                type: prometheus
                metric:
                  - name: hdd_errors_total
                    type: counter
                    desc: The total number of hdd errors.
                label:
                  - name: host
                    value: ${Hostname}
                  - name: device
                    value: ${device}
          systemd:
            output:
              push_to_default:
                tag: '*.systemd'
                type: copy
                store:
                  - type: relabel
                    label: default_output
                  - type: rewrite_tag_filter
                    rule:
                      - name: Payload
                        regexp: '^Out of memory'
                        result: metric.out_of_memory
                      - name: Payload
                        regexp: >-
                          'error.+[sv]d[a-z]+\d*'
                        result: metric.hdd_errors
                      - name: Payload
                        regexp: >-
                          '[sv]d[a-z]+\d*.+error'
                        result: metric.hdd_errors
              push_to_metric:
                tag: 'metric.**'
                type: relabel
                label: default_metric