Known issues

MKE 3.7.6 known issues with available workaround solutions include:

[MKE-10152] Upgrading large Windows clusters can initiate a rollback

Upgrades can rollback on a cluster with a large number of Windows worker nodes.

Workaround:

Invoke the --manual-worker-upgrade option and then manually upgrade the workers.

[MKE-10017] ucp-pause containers use incorrect version after upgrade rollback

After rolling back to MKE 3.6.0 - 3.6.4 during an upgrade to 3.7.0, some MKE nodes may run ucp-pause containers built from the upgraded version of the MKE images.

Workaround:

Perform the following steps on each Linux node where the ucp-pause containers are built from the upgraded MKE image version.

  1. Verify that the ucp-pause containers are using the MKE image version to which you tried to upgrade:

    docker ps -a | grep ucp-pause
    

    Example output:

    01a80dd175de   mirantiseng/ucp-pause:3.7.0   "/pause"   17 minutes ago   Up 16 minutes   k8s_POD_ucp-node-feature-discovery-9bwsj_node-feature-discovery_0a601160-ecf7-412f-bff8-e421a4f1712d_0
    498371f35994   mirantiseng/ucp-pause:3.7.0   "/pause"   20 minutes ago   Up 18 minutes   k8s_POD_coredns-7fb76597fc-k2q2k_kube-system_83fee771-dc1d-4e34-ae45-f0ab9dee5942_0
    a94cfcfb18f6   mirantiseng/ucp-pause:3.7.0   "/pause"   22 minutes ago   Up 21 minutes   k8s_POD_calico-kube-controllers-58c64b9976-mg5dn_kube-system_0b80ed92-be02-40de-827e-6a6b6e7f27da_0
    0a2cf203f77c   mirantiseng/ucp-pause:3.7.0   "/pause"   22 minutes ago   Up 21 minutes   k8s_POD_calico-node-f2xhl_kube-system_3c4a27c5-b832-417d-bc30-b6a7ca8f7627_0
    

    If the ucp-pause containers are using the correct image version, proceed to the next node.

  2. Copy the cri-dockerd-mke.service configuration file from the tmp directory to /usr/lib/systemd/system:

    sudo cp /tmp/cri-dockerd-mke.service /usr/lib/systemd/system
    
  3. Restart kubelet to load the most recent configuration file:

    docker rm -f ucp-kubelet
    
  4. Delete all ucp-pause containers that are on the node:

    docker rm -f <pause-containrer-id-1> <pause-containrer-id-n>
    
  5. Verify that the ucp-pause containers are using the correct MKE image version:

    docker ps -a | grep ucp-pause
    

    Example output:

    236b3dfb1bf6   mirantiseng/ucp-pause:3.6.4   "/pause"   12 seconds ago   Up 11 seconds   k8s_POD_calico-node-dp7hd_kube-system_d59d9004-5a59-46f8-8281-3c917c62fe20_0
    56994306b181   mirantiseng/ucp-pause:3.6.4   "/pause"   12 seconds ago   Up 11 seconds   k8s_POD_calico-kube-controllers-64844db68f-br9dh_kube-system_5ea39708-231a-45f5-aa7c-f7b842131941_0
    e62ae3c2a871   mirantiseng/ucp-pause:3.6.4   "/pause"   12 seconds ago   Up 11 seconds   k8s_POD_ucp-node-feature-discovery-rdrb7_node-feature-discovery_848cda05-74ec-4db2-825f-05afa53b2502_0
    d51eba420f34   mirantiseng/ucp-pause:3.6.4   "/pause"   12 seconds ago   Up 11 seconds   k8s_POD_coredns-78c7f4f4c7-lljzc_kube-system_92936b7c-6a7c-4eb5-a83f-22514acac636_0
    

[MKE-9699] Ingress Controller with external load balancer can enter crashloop

Due to the upstream Kubernetes issue 73140, rapid toggling of the Ingress Controller with an external load balancer in use can cause the resource to become stuck in a crashloop.

Workaround:

  1. Log in to the MKE web UI as an administrator.

  2. In the left-side navigation panel, navigate to <user name> > Admin Settings > Ingress.

  3. Click the Kubernetes tab to display the HTTP Ingress Controller for Kubernetes pane.

  4. Toggle the HTTP Ingress Controller for Kubernetes enabled control to the left to disable the Ingress Controller.

  5. Use the CLI to delete the Ingress Controller resources:

    kubectl delete service ingress-nginx-controller-admission --namespace ingress-nginx
    kubectl delete deployment ingress-nginx-controller --namespace
    ingress-nginx
    
  6. Verify the successful deletion of the resources:

    kubectl get all --namespace ingress-nginx
    

    Example output:

    No resources found in ingress-nginx namespace.
    
  7. Return to the HTTP Ingress Controller for Kubernetes pane in the MKE web UI and change the nodeport numbers for HTTP Port, HTTPS Port and TCP Port.

  8. Toggle the HTTP Ingress Controller for Kubernetes enabled control to the right to re-enable the Ingress Controller.

[MKE-8662] Swarm only manager nodes are labeled as mixed mode

When MKE is installed in swarm only mode, manager nodes start off in mixed mode. As Kubernetes installation is skipped altogether, however, they should be labeled as swarm mode.

Workaround: Change the labels following installation.

Change the labels following installation.

[MKE-8914] Windows Server Core with Containers images incompatible with GCP

The use of Windows ServerCore with Containers images will prevent kubelet from starting up, as these images are not compatible with GCP.

As a workaround, use Windows Server or Windows Server Core images.

[MKE-8814] Mismatched MTU values cause Swarm overlay network issues on GCP

Communication between GCP VPCs and Docker networks that use Swarm overlay networks will fail if their MTU values are not manually aligned. By default, the MTU value for GCP VPCs is 1460, while the default MTU value for Docker networks is 1500.

Workaround:

Select from the following options:

  • Create a new VPC and set the MTU value to 1500.

  • Set the MTU value of the existing VPC to 1500.

For more information, refer to the Google Cloud Platform documentation, Change the MTU setting of a VPC network.

[FIELD-6785] Reinstallation can fail following cluster CA rotation

If MKE 3.7.x is uninstalled soon after rotating cluster CA, re-installing MKE 3.7.x or 3.6.x on an existing docker swarm can fail with the following error messages:

unable to sign cert: {\"code\":1000,\"message\":\"x509: provided PrivateKey doesn't match parent's PublicKey\"}"

Workaround:

  1. Forcefully trigger swarm snapshot:

    old_val=$(docker info --format '{{.Swarm.Cluster.Spec.Raft.SnapshotInterval}}')
    docker swarm update --snapshot-interval 1
    docker swarm update --snapshot-interval ${old_val}
    
  2. Reattempt to install MKE.

[FIELD-6782] 0.0.0.0 node IP causes ucp-cluster-agent to continually restart

A Docker Swarm 0.0.0.0 IP address causes ucp-cluster-agent to continually restart with the following errors:

ucp-cluster-agent.1.l6r0qw5axgjn@ub2004-4    |
{"level":"debug","msg":"errored kubectl container stderr: The Endpoints
\"ucp-controller\" is invalid: subsets[0].addresses[0].ip: Invalid value:
\"0.0.0.0\": may not be unspecified
(0.0.0.0)\n","time":"2024-01-19T06:56:39Z"}

ucp-cluster-agent.1.l6r0qw5axgjn@ub2004-4    |
{"level":"error","msg":"Unable to handle cluster component
ucp-controller-service: unable to reconcile cluster component state: unable
to `kubectl apply` UCP Controller Service Endpoints: could not run kubectl
container: container ucp-kubectl exited with
1","time":"2024-01-19T06:56:39Z"}

Workaround:

Fix the node IP 0.0.0.0:

  1. Stop the docker daemon on the node with .Status.Addr 0.0.0.0.

  2. In the /var/lib/docker/swarm/docker-state.json file, apply the correct node IP to AdvertiseAddr and LocalAddr.

  3. Start the docker daemon.

Example result:

`{"LocalAddr":"","RemoteAddr":"10.200.200.10:2377","ListenAddr":"0.0.0.0:2377","AdvertiseAddr":"","DataPathAddr":"","DefaultAddressPool":null,"SubnetSize":0,"DataPathPort":0,"JoinInProgress":false,"FIPS":false}`

to

`{"LocalAddr":"10.200.200.13","RemoteAddr":"","ListenAddr":"0.0.0.0:2377","AdvertiseAddr":"10.200.200.13:2377","DataPathAddr":"","DefaultAddressPool":null,"SubnetSize":0,"DataPathPort":0,"JoinInProgress":false,"FIPS":false}

[FIELD-6402] Default metric collection memory settings may be insufficient

In MKE 3.7, ucp-metrics collects more metrics than in previous versions of MKE. As such, for large clusters with many nodes, the following ucp-metrics component default settings may be insufficient:

  • memory request: 1Gi

  • memory limit: 2Gi

Workaround:

Administrators can modify the MKE configuration file to increase the default memory request and memory limit setting values for the ucp-metrics component. The settings to configure are both under the cluster section:

  • For memory request, modify the prometheus_memory_request setting

  • For memory limit, modify the prometheus_memory_limit setting

[MKE-11281] cAdvisor Pods on Windows nodes cannot enter ‘Running’ state

When you enable cAdvisor, Pods are deployed to every node in the cluster. These cAdvisor Pods only work on Linux nodes, however, so the Pods that are inadvertently targeted to Windows nodes remain perpetually suspended and never actually run.

we inadvertently target Windows nodes with cAdvisor and the workaround updates the DaemonSet such that only Linux nodes are targeted.

Workaround:

Update the DaemonSet so that only Linux nodes are targeted by patching the ucp-cadvisor DaemonSet to include a node selector for Linux:

kubectl patch daemonset ucp-cadvisor -n kube-system --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/nodeSelector", "value":
{"kubernetes.io/os": "linux"}}]'

[MKE-11282] –swarm-only upgrade fails due to ‘unavailable’ manager ports

Upgrades to Swarm-only clusters that were originally installed using the --swarm-only fail pre-upgrade checks at the Check 7 of 8: [Port Requirements] step.

Workaround:

Include the --force-port-check upgrade option when upgrading a Swarm-only cluster.