etcd maintenance service#

To facilitate etcd maintenance, MKE 4k deploys the etcd maintenance host service on all controller nodes. This service is currently exposed only on localhost.

To configure the etcd maintenance service, edit the etcd section of the the mke4.yaml configuration file:

etcd:
  maintenanceService:
    httpPort: <desired_http_port>
    grpcPort: <desired_grpc_port>

User communication with the etcd service occurs through the HTTP port, the default for which is port 18088. Internal communication for the service occurs through the GRPC port, the default for which is port 5557.

The etcd maintenance service provides automated maintenance operations for the underlying etcd database in your cluster. Detailed in the table below, these operations serve to prevent etcd from running out of storage space and help maintain optimal cluster performance.

Operation	Detail
Kubernetes event cleanup and etcd compaction	Forwarding of cleanup to the etcd leader node. Performance of all cleanup operations by the leader node. A lock prevents concurrent cleanup operations.
etcd defragmentation	If enabled, runs automatically once the cleanup operation completes. Processes etcd members in sequential order. Pauses between members to avoid cluster disruption A lock prevents concurrent defragmentation operations
Maintenance operation scheduling	You can schedule the cleanup and compaction operation and the etcd defragmentation operation to run automatically at specific times. The etcd maintenance service runs on all control plane nodes and ensures only one maintenance operation is run at a time.

Best Practices

Set etcd maintenance to take place during periods of when cluster usage is minimal, such as weekends or early morning hours.
Run etcd maintenance at weekly or monthly intervals. The minimum allowed interval is 72 hours (3 days), and daily schedules are not permitted.
Enable both cleanup and defragmentation operations, to ensure optimal etcd health.
Monitor the first few maintenance runs to verify successful completion.
Set appropriate timeout intervals. Configure the defragTimeoutSeconds parameter based on your cluster size and etcd database size, taking into account that larger clusters may need longer timeout intervals.
Retain recent events with minTTLToKeepSeconds, as needed for troubleshooting. For insance, 86400 for 24 hours.