Patroni replication lag

PostgreSQL replication in a Patroni cluster is based on the Write-Ahead Log (WAL) syncing between the cluster leader and replica. Occasionally, this mechanism may lag due to networking issues, missing WAL segments (on rotation or recycle), increased Patroni Pods CPU usage, or due to a hardware failure.

In StackLight, the PostgresqlReplicationSlowWalDownload alert indicates that the Patroni cluster Replica is out of sync. This alert has the Warning severity because under such conditions Patroni cluster is still operational and the issue may disappear without intervention. However, a persisting replication lag may impact the cluster availability if another Pod in the cluster fails, leaving the leader alone to serve requests. In this case, the Patroni leader will become read-only and unable to serve write requests, which can cause outage of Alerta backed by Patroni. Grafana, which also uses Patroni, will still be operational but any dashboard changes will not be saved.

Therefore, if PostgresqlReplicationSlowWalDownload fires, observe the cluster and fix it if the issue persists or if the lag grows.

To apply the issue resolution:

  1. Enter the Patroni cluster Pod:

    kubectl exec -it -n stacklight patroni-13-2 patroni -- bash
  2. Verify the current cluster state:

    patronictl -c postgres.yml list

    In the Lag in MB column of the output table, the replica Pod will indicate a non-zero value.

  3. Enter the leader Pod if it is not the current one.

  4. From the leader Pod, resync the replica Pod:

    patronictl -c postgres.yml reinit patroni-13 <REPLICA-MEMBER-NAME>
  5. In the Alertmanager or Alerta web UI, verify that no new alerts are firing for Patroni. The PostgresqlInsufficientWorkingMemory alert may become pending during the operation but should not fire.

  6. Verify that the replication is in sync:

    patronictl -c postgres.yml list

    Example of a positive system response:

    + Cluster: patroni-13 (6974829572195451235)---+---------+-----+-----------+
    | Member       | Host          | Role         | State   |  TL | Lag in MB |
    | patroni-13-0 |  | Replica      | running | 875 |         0 |
    | patroni-13-1 | | Leader       | running | 875 |           |
    | patroni-13-2 | | Sync Standby | running | 875 |         0 |