Patroni replication lag¶
PostgreSQL replication in a Patroni cluster is based on the Write-Ahead Log (WAL) syncing between the cluster leader and replica. Occasionally, this mechanism may lag due to networking issues, missing WAL segments (on rotation or recycle), increased Patroni Pods CPU usage, or due to a hardware failure.
In StackLight, the PostgresqlReplicationSlowWalDownload
alert indicates
that the Patroni cluster Replica
is out of sync. This alert has the
Warning
severity because under such conditions Patroni cluster is still
operational and the issue may disappear without intervention. However, a
persisting replication lag may impact the cluster availability if another Pod
in the cluster fails, leaving the leader alone to serve requests. In this case,
the Patroni leader will become read-only and unable to serve write requests,
which can cause outage of Alerta backed by Patroni. Grafana, which also uses
Patroni, will still be operational but any dashboard changes will not be saved.
Therefore, if PostgresqlReplicationSlowWalDownload
fires, observe the
cluster and fix it if the issue persists or if the lag grows.
To apply the issue resolution:
Enter the Patroni cluster Pod:
kubectl exec -it -n stacklight patroni-13-2 patroni -- bash
Verify the current cluster state:
patronictl -c postgres.yml list
In the
Lag in MB
column of the output table, the replica Pod will indicate a non-zero value.Enter the leader Pod if it is not the current one.
From the leader Pod, resync the replica Pod:
patronictl -c postgres.yml reinit patroni-13 <REPLICA-MEMBER-NAME>
In the Alertmanager or Alerta web UI, verify that no new alerts are firing for Patroni. The
PostgresqlInsufficientWorkingMemory
alert may become pending during the operation but should not fire.Verify that the replication is in sync:
patronictl -c postgres.yml list
Example of a positive system response:
+ Cluster: patroni-13 (6974829572195451235)---+---------+-----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +--------------+---------------+--------------+---------+-----+-----------+ | patroni-13-0 | 10.233.96.11 | Replica | running | 875 | 0 | | patroni-13-1 | 10.233.108.39 | Leader | running | 875 | | | patroni-13-2 | 10.233.64.113 | Sync Standby | running | 875 | 0 | +--------------+---------------+--------------+---------+-----+-----------+