Machine provisioning issues during cluster deployment with private network¶
The issue can occur on a node with Intel NICs or on a node with incorrect BIOS settings, for example, an incorrect first boot option in the BIOS boot order.
A node may randomly get stuck in the Provisioning
state during the Equinix Metal based management or managed cluster deployment
with a private network. In this case, the affected machine is non-pingable
using the internal IP, for example, 192.168.0.53.
The issue relates to particular hardware with Intel Boot Agent (IBA) installed, which is configured to be the first boot option on the server. An affected server will continue booting from iPXE instead of booting from a hard drive even after a successful provisioning. As a result, the machine becomes inaccessible and cluster deployment gets stuck.
Verify whether the cluster is affected¶
Verify that the
BareMetalHosts
object status of the affected machine isProvisioned
:kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get bmh <machineName>
Example of system response:
NAME STATE CONSUMER BOOTMODE ONLINE ERROR REGION mgmt-controlplane-0 provisioned mgmt-controlplane-0 true region-one
Obtain the internal IP of the affected machine:
kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get machine <machineName> -o=jsonpath={.status.providerStatus.privateIp}
Ping the affected machine.
If the BareMetalHost
state of the affected machine is provisioned
but the machine is non-pingable using the internal IP for a long time, the
cluster is affected. Proceed to the issue resolution below.
Warning
The issue resolution below requires access to the Equinix Metal out-of-band console Serial Over SSH (SOS) that allows debugging hardware provisioning failures. To access SOS, the affected machine should have at least one Equinix Metal Project SSH Key attached.
The default management cluster bootstrap procedure includes creation and attachment of this key to each cluster machine. If you skipped this step during cluster configuration, proceed to the following section to attach the project SSH key to the affected machine. Otherwise, skip this section.
A managed cluster procedure does not include creation and attachment of the SSH key to cluster machines. Therefore, if a managed cluster is affected, proceed to the following section unless you have previously completed these steps.
Note
To apply the issue resolution, you can also use the workaround provided in Release Notes.
Attach the Equinix Metal project SSH key to the affected machine¶
Obtain the ID of the affected machine:
kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get machine <machineName> -o=jsonpath={.status.providerStatus.providerInstanceState.id}
Log in to the Equinix Metal console.
Access the affected machine menu in the Equinix Metal console by navigating to
https://console.equinix.com/devices/<affectedMachineID>
.In the Overview tab, copy the affected server name in the
kaas-node-<UID>
format.Repeat the above steps for all affected machines.
In the upper menu, navigate to Project Settings.
In the Project SSH Keys tab, select Add New Key.
Enter the unique Key Name and Public Key values.
Caution
Do not use already existing project keys.
In the Associate key with these instances section, select the previously copied affected server names and click Add to associate the new key with all affected servers.
Caution
If you add SSH keys to the affected servers one by one, each server will require a new SSH key.
Apply the issue resolution¶
Obtain the ID of the affected machine:
kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get machine <machineName> -o=jsonpath={.status.providerStatus.providerInstanceState.id}
Log in to the Equinix Metal console.
Access the affected machine menu in the Equinix Metal console by navigating to
https://console.equinix.com/devices/<affectedMachineID>
.Click OUT-OF-BAND CONSOLE and copy the provided command.
Access the out-of-band console of the affected machine using the command obtained in the previous step.
Add the
-i <pathToProjectPrivateSSHKey>
flag that adds the path to the private SSH key associated with the affected instance. For example:ssh -i /home/ubuntu/.ssh/ssh-key-demo-name ec123d60-e11e-4b49-v8a0-105a4df41bf4@sos.fr2.platformequinix.com
In the Equinix Metal console of the affected machine, navigate to SERVER ACTIONS > REBOOT to run the machine reboot.
In the out-of-band console, wait for the machine to start the reboot. Once the machine starts powering on:
Press
F2
or the proposed key combination to enterBIOS Setup
menu.Note
The key combination depends on the BIOS version. The exact combination is displayed when the machine powers on.
If the
BIOS Setup
menu option disappears before you pressF2
, reboot the machine and try again.
Navigate to the
Boot
BIOS menu.Change the boot order by moving the hard drive where the system was installed to the top and
Intel Boot Agent
(network booting) to the bottom.To obtain the drive where the system was installed, use the
BareMetalHost
object of the affected machine:kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <clusterProjectName> get bmh <machineName> -o=jsonpath={.spec.ansibleExtra.target_system.grub.to_devices_detailed[0].by_id}
In the Equinix Metal console of the affected machine, navigate to SERVER ACTIONS > REBOOT and reboot the machine again.