Inspection error on bare metal hosts after dnsmasq restart

If the dnsmasq pod is restarted during the bootstrap of newly added nodes, those nodes may fail to undergo inspection. That can result in inspection error in the corresponding BareMetalHost objects.

The issue can occur when:

  • The dnsmasq pod was moved to another node.

  • DHCP subnets were changed, including addition or removal. In this case, the dhcpd container of the dnsmasq pod is restarted.

    Caution

    If changing or adding of DHCP subnets is required to bootstrap new nodes, wait after changing or adding DHCP subnets until the dnsmasq pod becomes ready, then create BareMetalHost objects.

To verify whether the nodes are affected:

  1. Verify whether the BareMetalHost objects contain the inspection error:

    kubectl get bmh -n <managed-cluster-namespace-name>
    

    Example of system response:

    NAME            STATE         CONSUMER        ONLINE   ERROR              AGE
    test-master-1   provisioned   test-master-1   true                        9d
    test-master-2   provisioned   test-master-2   true                        9d
    test-master-3   provisioned   test-master-3   true                        9d
    test-worker-1   provisioned   test-worker-1   true                        9d
    test-worker-2   provisioned   test-worker-2   true                        9d
    test-worker-3   inspecting                    true     inspection error   19h
    
  2. Verify whether the dnsmasq pod was in Ready state when the inspection of the affected baremetal hosts (test-worker-3 in the example above) was started:

    kubectl -n kaas get pod <dnsmasq-pod-name> -oyaml
    

    Example of system response:

    ...
    status:
      conditions:
      - lastProbeTime: null
        lastTransitionTime: "2024-10-10T15:37:34Z"
        status: "True"
        type: Initialized
      - lastProbeTime: null
        lastTransitionTime: "2024-10-11T07:38:54Z"
        status: "True"
        type: Ready
      - lastProbeTime: null
        lastTransitionTime: "2024-10-11T07:38:54Z"
        status: "True"
        type: ContainersReady
      - lastProbeTime: null
        lastTransitionTime: "2024-10-10T15:37:34Z"
        status: "True"
        type: PodScheduled
      containerStatuses:
      - containerID: containerd://6dbcf2fc4b36ce4c549c9191ab01f72d0236c51d42947675302675e4bfaf4cdf
        image: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq:base-2-28-alpine-20240812132650
        imageID: docker-dev-kaas-virtual.artifactory-eu.mcp.mirantis.net/bm/baremetal-dnsmasq@sha256:3dad3e278add18e69b2608e462691c4823942641a0f0e25e6811e703e3c23b3b
        lastState:
          terminated:
            containerID: containerd://816fcf079cd544acd74e312065de5b5ed4dbf1dc6159fefffff4f644b5e45987
            exitCode: 0
            finishedAt: "2024-10-11T07:38:35Z"
            reason: Completed
            startedAt: "2024-10-10T15:37:45Z"
        name: dhcpd
        ready: true
        restartCount: 2
        started: true
        state:
          running:
            startedAt: "2024-10-11T07:38:37Z"
      ...
    

    In the system response above, the dhcpd container was not ready between "2024-10-11T07:38:35Z" and "2024-10-11T07:38:54Z".

  3. Verify the affected baremetal host. For example:

    kubectl get bmh -n managed-ns test-worker-3 -oyaml
    

    Example of system response:

    ...
    status:
      errorCount: 15
      errorMessage: Introspection timeout
      errorType: inspection error
      ...
      operationHistory:
        deprovision:
          end: null
          start: null
        inspect:
          end: null
          start: "2024-10-11T07:38:19Z"
        provision:
          end: null
          start: null
        register:
          end: "2024-10-11T07:38:19Z"
          start: "2024-10-11T07:37:25Z"
    

    In the system response above, inspection was started at "2024-10-11T07:38:19Z", immediately before the period of the dhcpd container downtime. Therefore, this node is most likely affected by the issue.

To apply the issue resolution:

  1. Reboot the node using the IPMI reset or cycle command.

  2. If the node fails to boot, remove the failed BareMetalHost object and create it again:

    1. Remove BareMetalHost object. For example:

      kubectl delete bmh -n managed-ns test-worker-3
      
    2. Verify that the BareMetalHost object is removed:

      kubectl get bmh -n managed-ns test-worker-3
      
    3. Create a BareMetalHost object from the template. For example:

      kubectl create -f bmhc-test-worker-3.yaml
      kubectl create -f bmh-test-worker-3.yaml