Inspection error on bare metal hosts after dnsmasq restart¶
If the dnsmasq
pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspection error
in the corresponding BareMetalHost
objects.
The issue can occur when:
The
dnsmasq
pod was moved to another node.DHCP subnets were changed, including addition or removal. In this case, the
dhcpd
container of thednsmasq
pod is restarted.Caution
If changing or adding of DHCP subnets is required to bootstrap new nodes, wait after changing or adding DHCP subnets until the
dnsmasq
pod becomes ready, then createBareMetalHost
objects.
To verify whether the nodes are affected:
Verify whether the
BareMetalHost
objects contain theinspection error
:kubectl get bmh -n <managed-cluster-namespace-name>
Example of system response:
NAME STATE CONSUMER ONLINE ERROR AGE test-master-1 provisioned test-master-1 true 9d test-master-2 provisioned test-master-2 true 9d test-master-3 provisioned test-master-3 true 9d test-worker-1 provisioned test-worker-1 true 9d test-worker-2 provisioned test-worker-2 true 9d test-worker-3 inspecting true inspection error 19h
Verify whether the
dnsmasq
pod was inReady
state when the inspection of the affected baremetal hosts (test-worker-3
in the example above) was started:kubectl -n kaas get pod <dnsmasq-pod-name> -oyaml
Example of system response:
... status: conditions: - lastProbeTime: null lastTransitionTime: "2024-10-10T15:37:34Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-10-11T07:38:54Z" status: "True" type: Ready - lastProbeTime: null lastTransitionTime: "2024-10-11T07:38:54Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-10-10T15:37:34Z" status: "True" type: PodScheduled containerStatuses: - containerID: containerd://6dbcf2fc4b36ce4c549c9191ab01f72d0236c51d42947675302675e4bfaf4cdf image: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq:base-2-28-alpine-20240812132650 imageID: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq@sha256:3dad3e278add18e69b2608e462691c4823942641a0f0e25e6811e703e3c23b3b lastState: terminated: containerID: containerd://816fcf079cd544acd74e312065de5b5ed4dbf1dc6159fefffff4f644b5e45987 exitCode: 0 finishedAt: "2024-10-11T07:38:35Z" reason: Completed startedAt: "2024-10-10T15:37:45Z" name: dhcpd ready: true restartCount: 2 started: true state: running: startedAt: "2024-10-11T07:38:37Z" ...
In the system response above, the
dhcpd
container was not ready between"2024-10-11T07:38:35Z"
and"2024-10-11T07:38:54Z"
.Verify the affected baremetal host. For example:
kubectl get bmh -n managed-ns test-worker-3 -oyaml
Example of system response:
... status: errorCount: 15 errorMessage: Introspection timeout errorType: inspection error ... operationHistory: deprovision: end: null start: null inspect: end: null start: "2024-10-11T07:38:19Z" provision: end: null start: null register: end: "2024-10-11T07:38:19Z" start: "2024-10-11T07:37:25Z"
In the system response above, inspection was started at
"2024-10-11T07:38:19Z"
, immediately before the period of thedhcpd
container downtime. Therefore, this node is most likely affected by the issue.
To apply the issue resolution:
Reboot the node using the IPMI reset or cycle command.
If the node fails to boot, remove the failed
BareMetalHost
object and create it again:Remove
BareMetalHost
object. For example:kubectl delete bmh -n managed-ns test-worker-3
Verify that the
BareMetalHost
object is removed:kubectl get bmh -n managed-ns test-worker-3
Create a
BareMetalHost
object from the template. For example:kubectl create -f bmhc-test-worker-3.yaml kubectl create -f bmh-test-worker-3.yaml