Deployment architecture

Mirantis Container Cloud deploys the StackLight stack as a release of a Helm chart that contains the helm-controller and helmbundles.lcm.mirantis.com (HelmBundle) custom resources. The StackLight HelmBundle consists of a set of Helm charts with the StackLight components that include:

StackLight components overview

StackLight component

Description

Alerta

Receives, consolidates, and deduplicates the alerts sent by Alertmanager and visually represents them through a simple web UI. Using the Alerta web UI, you can view the most recent or watched alerts, group, and filter alerts.

Alertmanager

Handles the alerts sent by client applications such as Prometheus, deduplicates, groups, and routes alerts to receiver integrations. Using the Alertmanager web UI, you can view the most recent fired alerts, silence them, or view the Alertmanager configuration.

Elasticsearch curator

Maintains the data (indexes) in Elasticsearch by performing such operations as creating, closing, or opening an index as well as deleting a snapshot. Also, manages the data retention policy in Elasticsearch.

Elasticsearch exporter

The Prometheus exporter that gathers internal Elasticsearch metrics.

Grafana

Builds and visually represents metric graphs based on time series databases. Grafana supports querying of Prometheus using the PromQL language.

Database back ends

StackLight uses PostgreSQL for Alerta and Grafana. PostgreSQL reduces the data storage fragmentation while enabling high availability. High availability is achieved using Patroni, the PostgreSQL cluster manager that monitors for node failures and manages failover of the primary node. StackLight also uses Patroni to manage major version upgrades of PostgreSQL clusters, which allows leveraging the database engine functionality and improvements as they are introduced upstream in new releases, maintaining functional continuity without version lock-in.

Logging stack

Responsible for collecting, processing, and persisting logs and Kubernetes events. By default, when deploying through the Container Cloud web UI, only the metrics stack is enabled on managed clusters. To enable StackLight to gather managed cluster logs, enable the logging stack during deployment. On management clusters, the logging stack is enabled by default. The logging stack components include:

  • Elasticsearch, which stores logs and notifications.

  • Fluentd-elasticsearch, which collects logs, sends them to Elasticsearch, generates metrics based on analysis of incoming log entries, and exposes these metrics to Prometheus.

  • Kibana, which provides real-time visualization of the data stored in Elasticsearch and enables you to detect issues.

  • Metricbeat, which collects Kubernetes events and sends them to Elasticsearch for storage.

  • Prometheus-es-exporter, which presents the Elasticsearch data as Prometheus metrics by periodically sending configured queries to the Elasticsearch cluster and exposing the results to a scrapable HTTP endpoint like other Prometheus targets.

  • Optional. Cerebro, a web UI for managing the Elasticsearch cluster. Using the Cerebro web UI, you can get a detailed view on your Elasticsearch cluster and debug issues. Cerebro is disabled by default.

Note

The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdElasticsearch. For details, see StackLight configuration parameters: Resource limits.

Metric collector

Collects telemetry data (CPU or memory usage, number of active alerts, and so on) from Prometheus and sends the data to centralized cloud storage for further processing and analysis. Metric collector runs on the management cluster.

Prometheus

Gathers metrics. Automatically discovers and monitors the endpoints. Using the Prometheus web UI, you can view simple visualizations and debug. By default, the Prometheus database stores metrics of the past 15 days or up to 15 GB of data depending on the limit that is reached first.

Prometheus-es-exporter

Presents the Elasticsearch data as Prometheus metrics by periodically sending configured queries to the Elasticsearch cluster and exposing the results to a scrapable HTTP endpoint like other Prometheus targets.

Prometheus node exporter

Gathers hardware and operating system metrics exposed by kernel.

Prometheus Relay

Adds a proxy layer to Prometheus to merge the results from underlay Prometheus servers to prevent gaps in case some data is missing on some servers. Is available only in the HA StackLight mode.

Pushgateway

Enables ephemeral and batch jobs to expose their metrics to Prometheus. Since these jobs may not exist long enough to be scraped, they can instead push their metrics to Pushgateway, which then exposes these metrics to Prometheus. Pushgateway is not an aggregator or a distributed counter but rather a metrics cache. The pushed metrics are exactly the same as scraped from a permanently running program.

Salesforce notifier

Enables sending Alertmanager notifications to Salesforce to allow creating Salesforce cases and closing them once the alerts are resolved. Disabled by default.

Salesforce reporter

Queries Prometheus for the data about the amount of vCPU, vRAM, and vStorage used and available, combines the data, and sends it to Salesforce daily. Mirantis uses the collected data for further analysis and reports to improve the quality of customer support. Disabled by default.

Telegraf

Collects metrics from the system. Telegraf is plugin-driven and has the concept of two distinct set of plugins: input plugins collect metrics from the system, services, or third-party APIs; output plugins write and expose metrics to various destinations.

The Telegraf agents used in Container Cloud include:

  • telegraf-ds-smart monitors SMART disks, and runs on both management and managed clusters.

  • telegraf-ironic monitors Ironic on the baremetal-based management clusters. The ironic input plugin collects and processes data from Ironic HTTP API, while the http_response input plugin checks Ironic HTTP API availability. As an output plugin, to expose collected data as Prometheus target, Telegraf uses prometheus.

  • telegraf-docker-swarm gathers metrics from the Mirantis Container Runtime API about the Docker nodes, networks, and Swarm services. This is a Docker Telegraf input plugin with downstream additions.

Telemeter

Enables a multi-cluster view through a Grafana dashboard of the management cluster. Telemeter includes a Prometheus federation push server and clients to enable isolated Prometheus instances, which cannot be scraped from a central Prometheus instance, to push metrics to the central location.

The Telemeter services are distributed as follows:

  • Management cluster hosts the Telemeter server

  • Regional clusters host the Telemeter server and Telemeter client

  • Managed clusters host the Telemeter client

The metrics from managed clusters are aggregated on regional clusters. Then both regional and managed clusters metrics are sent from regional clusters to the management cluster.

Every Helm chart contains a default values.yml file. These default values are partially overridden by custom values defined in the StackLight Helm chart.

Before deploying a management or managed cluster, you can select the HA or non-HA StackLight architecture type. The non-HA mode is set by default. The following table lists the differences between the HA and non-HA modes:

StackLight database modes

Non-HA StackLight mode default

HA StackLight mode

  • One Prometheus instance

  • One Elasticsearch instance

  • One PostgreSQL instance

One persistent volume is provided for storing data. In case of a service or node failure, a new pod is redeployed and the volume is reattached to provide the existing data. Such setup has a reduced hardware footprint but provides less performance.

  • Two Prometheus instances

  • Three Elasticsearch instances

  • Three PostgreSQL instances

Local Volume Provisioner is used to provide local host storage. In case of a service or node failure, the traffic is automatically redirected to any other running Prometheus or Elasticsearch server. For better performance, Mirantis recommends that you deploy StackLight in the HA mode.