Kubernetes event cleanup and etcd compaction#

Kubernetes events are generated in response to changes within Kubernetes resources, such as nodes, Pods, or containers. These events are created with a time to live, or TTL, after which they are automatically cleaned up. If, however, a large number of Kubernetes events are generated or other cluster issues arise, it may be necessary to manually clean up the Kubernetes events to prevent etcd from exceeding its quota.

MKE 4k provides event cleanup and compaction utilities that you can use to directly clean up Kubernetes event objects within your cluster. Using these utilities, you can specify whether all of the Kubernetes events should be deleted or only those that have a certain TTL.

The etcd cleanup operation performs the following key tasks:

Deletion of Kubernetes events

Removal of event objects that are stored in etcd under /registry/events/. By default, all events are deleted, however you can configure a minimum TTL to retain recent events.
Revision compaction

Following key deletion, etcd compacts everything up to, but not including, the latest revision to mark the deleted keys as available for garbage collection. This is a critical step, as deleted keys remain in previous etcd revisions until compaction occurs.

Notes

Cleanup operations are forwarded to the etcd leader node to ensure that they only run once per cluster. A lock mechanism prevents concurrent cleanup operations.
Defragmentation is a separate operation that you should perform following the cleanup operation, if you want to reclaim physical disk space. Refer to etcd defragmentation for detailed information.

Prerequisites#

SSH access to a controller node
Default HTTP port is set to 18088, which you can configure through the httpPort parameter in the mke4.yaml configuration file.

Run the etcd cleanup#

Use SSH to connect to any controller node in your MKE 4k cluster:
```
ssh -i <path_to_ssh_key> user@<controller_node_ip>
```
Verify that the etcd maintenance service is running:
```
curl http://localhost:18088/health
```
Expected response:
```
{"status":"OK"}
```
If the service is not running, check the systemd service status:
```
sudo systemctl status etcd-maintenance
```
Perform a dry run to determine what will be deleted if no changes are made.
```
curl -X POST 'http://localhost:18088/cluster/cleanup?dryRun=true'
```
Example response:
```
{
"message": "Cluster cleanup initiated by leader 172.31.0.153: cleanup dryrun completed successfully, 1273 keys would be deleted"
}
```
To view a list of the keys that will be deleted, include the showKeys=true parameter in the command:
```
curl -X POST 'http://localhost:18088/cluster/cleanup?dryRun=true&showKeys=true'
```
Note

The response to the dry run command shows the first 1000 keys that a true run would delete. The response can be voluminous for clusters with many events.

Perform the cleanup by setting dryRun=false, or omit the parameter altogether.

curl -X POST 'http://localhost:18088/cluster/cleanup?dryRun=false'

Example response:

{
"message": "Cluster cleanup initiated by leader 172.31.0.153: cleanup completed successfully, 1273 keys deleted"
}

The cleanup operation deletes the specified Kubernetes events from etcd and automatically compacts the latest etcd revision to mark deleted keys as available for garbage collection.

Advanced cleanup option: Retain events by minimum TTL#

Use the MinTTLToKeepSeconds parameter to retain events that are newer than a specified age:

# Retain events newer than 24 hours (86400 seconds)
curl -X POST 'http://localhost:18088/cluster/cleanup?dryRun=false&MinTTLToKeepSeconds=86400'

This deletes only the events that have a TTL that is lower than the specified value.

Summary of cleanup options#

Parameter	Description	Data type	Default
`dryRun`	If set to `true`, performs a dry run without actual key deletion.	Boolean	`false`
`MinTTLToKeepSeconds`	Sets minimum TTL in seconds for events to retain. Only events with a lower TTL will be deleted. If not specified, all events are deleted.	Integer	Optional
`showKeys`	If set to `true`, includes the list of deleted keys in the response. Limited to first 1000 keys.	Boolean	`false`

Troubleshooting#

Cleanup failure#

If cleanup operations fail:

Check the error message in the response.
Verify that you are connected to a controller node.
Ensure that the etcd cluster is healthy.
Check whether another cleanup operation is in progress, as cleanup operations are locked to prevent concurrent execution.

No Keys Deleted#

If the cleanup reports 0 keys deleted:

All events may have already been cleaned up automatically according to their set TTL.
The MinTTLToKeepSeconds parameter may be filtering out all events.
Perform a dry run with the showKeys=true option to determine what will be deleted in a true run.