Machine provisioning issues during cluster deployment with private network

A node with Intel NICs may randomly get stuck in the Provisioning state during the Equinix Metal based management or managed cluster deployment with a private network. In this case, the affected machine is non-pingable using the internal IP, for example, 192.168.0.53.

The issue relates to particular hardware with Intel Boot Agent (IBA) installed, which is configured to be the first boot option on the server. An affected server will continue booting from iPXE instead of booting from a hard drive even after a successful provisioning. As a result, the machine becomes inaccessible and cluster deployment gets stuck.

Verify whether the cluster is affected

  1. Verify that the BareMetalHosts object status of the affected machine is Provisioned:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get bmh <machineName>
    

    Example of system response:

    NAME                STATE       CONSUMER             BOOTMODE ONLINE ERROR REGION
    mgmt-controlplane-0 provisioned mgmt-controlplane-0  true                  region-one
    
  2. Obtain the internal IP of the affected machine:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get machine <machineName> -o=jsonpath={.status.providerStatus.privateIp}
    
  3. Ping the affected machine.

If the BareMetalHost state of the affected machine is provisioned but the machine is non-pingable using the internal IP for a long time, the cluster is affected. Proceed to the issue resolution below.

Warning

The issue resolution below requires access to the Equinix Metal out-of-band console Serial Over SSH (SOS) that allows debugging hardware provisioning failures. To access SOS, the affected machine should have at least one Equinix Metal Project SSH Key attached.

The default management cluster bootstrap procedure includes creation and attachment of this key to each cluster machine. If you skipped this step during cluster configuration, proceed to the following section to attach the project SSH key to the affected machine. Otherwise, skip this section.

A managed cluster procedure does not include creation and attachment of the SSH key to cluster machines. Therefore, if a managed cluster is affected, proceed to the following section unless you have previously completed these steps.

Note

To apply the issue resolution, you can also use the workaround provided in Release Notes.

Attach the Equinix Metal project SSH key to the affected machine

  1. Obtain the ID of the affected machine:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get machine <machineName> -o=jsonpath={.status.providerStatus.providerInstanceState.id}
    
  2. Log in to the Equinix Metal console.

  3. Access the affected machine menu in the Equinix Metal console by navigating to https://console.equinix.com/devices/<affectedMachineID>.

  4. In the Overview tab, copy the affected server name in the kaas-node-<UID> format.

  5. Repeat the above steps for all affected machines.

  6. In the upper menu, navigate to Project Settings.

  7. In the Project SSH Keys tab, select Add New Key.

  8. Enter the unique Key Name and Public Key values.

    Caution

    Do not use already existing project keys.

  9. In the Associate key with these instances section, select the previously copied affected server names and click Add to associate the new key with all affected servers.

    Caution

    If you add SSH keys to the affected servers one by one, each server will require a new SSH key.

Apply the issue resolution

  1. Obtain the ID of the affected machine:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get machine <machineName> -o=jsonpath={.status.providerStatus.providerInstanceState.id}
    
  2. Log in to the Equinix Metal console.

  3. Access the affected machine menu in the Equinix Metal console by navigating to https://console.equinix.com/devices/<affectedMachineID>.

  4. Click OUT-OF-BAND CONSOLE and copy the provided command.

  5. Access the out-of-band console of the affected machine using the command obtained in the previous step.

    Add the -i <pathToProjectPrivateSSHKey> flag that adds the path to the private SSH key associated with the affected instance. For example:

    ssh -i /home/ubuntu/.ssh/ssh-key-demo-name ec123d60-e11e-4b49-v8a0-105a4df41bf4@sos.fr2.platformequinix.com
    
  6. In the Equinix Metal console of the affected machine, navigate to SERVER ACTIONS > REBOOT to run the machine reboot.

  7. In the out-of-band console, wait for the machine to start the reboot. Once the machine starts powering on:

    1. Press F2 or the proposed key combination to enter BIOS Setup menu.

      Note

      • The key combination depends on the BIOS version. The exact combination is displayed when the machine powers on.

      • If the BIOS Setup menu option disappears before you press F2, reboot the machine and try again.

    2. Navigate to the Boot BIOS menu.

    3. Change the boot order by moving the SSD hard drive to the top and Intel Boot Agent (network booting) to the bottom.

  8. In the Equinix Metal console of the affected machine, navigate to SERVER ACTIONS > REBOOT and reboot the machine again.