Inspection error on bare metal hosts after dnsmasq restart¶

If the dnsmasq pod is restarted during the bootstrap of newly added nodes, those nodes may fail to undergo inspection. That can result in inspection error in the corresponding BareMetalHost objects.

The issue can occur when:

The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the dhcpd container of the dnsmasq pod is restarted.

Caution

If changing or adding of DHCP subnets is required to bootstrap new nodes, wait after changing or adding DHCP subnets until the dnsmasq pod becomes ready, then create BareMetalHost objects.

To verify whether the nodes are affected:

Verify whether the BareMetalHost objects contain the inspection error:

kubectl get bmh -n <managed-cluster-namespace-name>

Example of system response:

NAME            STATE         CONSUMER        ONLINE   ERROR              AGE
test-master-1   provisioned   test-master-1   true                        9d
test-master-2   provisioned   test-master-2   true                        9d
test-master-3   provisioned   test-master-3   true                        9d
test-worker-1   provisioned   test-worker-1   true                        9d
test-worker-2   provisioned   test-worker-2   true                        9d
test-worker-3   inspecting                    true     inspection error   19h

Verify whether the dnsmasq pod was in Ready state when the inspection of the affected baremetal hosts (test-worker-3 in the example above) was started:

kubectl -n kaas get pod <dnsmasq-pod-name> -oyaml

Example of system response:

...
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-10T15:37:34Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-10-11T07:38:54Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-10-11T07:38:54Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-10-10T15:37:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://6dbcf2fc4b36ce4c549c9191ab01f72d0236c51d42947675302675e4bfaf4cdf
    image: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq:base-2-28-alpine-20240812132650
    imageID: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq@sha256:3dad3e278add18e69b2608e462691c4823942641a0f0e25e6811e703e3c23b3b
    lastState:
      terminated:
        containerID: containerd://816fcf079cd544acd74e312065de5b5ed4dbf1dc6159fefffff4f644b5e45987
        exitCode: 0
        finishedAt: "2024-10-11T07:38:35Z"
        reason: Completed
        startedAt: "2024-10-10T15:37:45Z"
    name: dhcpd
    ready: true
    restartCount: 2
    started: true
    state:
      running:
        startedAt: "2024-10-11T07:38:37Z"
  ...

In the system response above, the dhcpd container was not ready between "2024-10-11T07:38:35Z" and "2024-10-11T07:38:54Z".

Verify the affected baremetal host. For example:

kubectl get bmh -n managed-ns test-worker-3 -oyaml

Example of system response:

...
status:
  errorCount: 15
  errorMessage: Introspection timeout
  errorType: inspection error
  ...
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: null
      start: "2024-10-11T07:38:19Z"
    provision:
      end: null
      start: null
    register:
      end: "2024-10-11T07:38:19Z"
      start: "2024-10-11T07:37:25Z"

In the system response above, inspection was started at "2024-10-11T07:38:19Z", immediately before the period of the dhcpd container downtime. Therefore, this node is most likely affected by the issue.

To apply the issue resolution:

Reboot the node using the IPMI reset or cycle command.
If the node fails to boot, remove the failed BareMetalHost object and create it again:
1. Remove BareMetalHost object. For example:
```
kubectl delete bmh -n managed-ns test-worker-3
```
2. Verify that the BareMetalHost object is removed:
```
kubectl get bmh -n managed-ns test-worker-3
```
3. Create a BareMetalHost object from the template. For example:
```
kubectl create -f bmhc-test-worker-3.yaml
kubectl create -f bmh-test-worker-3.yaml
```