Troubleshooting

This section describes possible issues you may encounter while working with day-2 operations as well as approaches on how to address these issues.

Troubleshoot the HostOSConfigurationModules object

In .status.modules, verify whether all modules have been loaded and verified successfully. Each module must have the available value in the state field. If not, the error field contains the reason of the issue.

Example of different erroneous states in a hocm object:

status:
  modules:
  # error state: hashes mismatched
  - error: 'hashes are not the same: got ''d78352e51792bbe64e573b841d12f54af089923c73bc185bac2dc5d0e6be84cd''
      want ''c726ab9dfbfae1d1ed651bdedd0f8b99af589e35cb6c07167ce0ac6c970129ac'''
    name: sysctl
    sha256sum: d78352e51792bbe64e573b841d12f54af089923c73bc185bac2dc5d0e6be84cd
    state: error
    url: <url-to-package>
    version: 1.0.0
  # error state: an archive is not available because of misconfigured proxy
  - error: 'failed to perform request to fetch the module archive: Get "<url-to-package>": Forbidden'
    name: custom-module
    state: error
    url: <url-to-package>
    version: 0.0.1
  # successfully loaded and verified module
  - description: Module for package installation
    docURL: https://docs.mirantis.com
    name: package
    playbookName: main.yaml
    sha256sum: 2c7c91206ce7a81a90e0068cd4ce7ca05eab36c4da1893555824b5ab82c7cc0e
    state: available
    url: <url-to-package>
    valuesValidationSchema: <gzip+base64 encoded data>
    version: 1.0.0

If a module is in the error state, it might affect the corresponding hoc object that contains the module configuration.

Example of erroneous status in a hoc object:

status:
  configs:
  - moduleName: sysctl
    moduleVersion: 1.0.0
    modulesReference: mcc-modules
    error: module is not found or not verified in any HostOSConfigurationModules object

To resolve an issue described in the error field:

  1. Address the root cause. For example, ensure that a package has the correct hash sum, or adjust the proxy configuration to fetch the package, and so on.

  2. Recreate the hocm object with correct settings.

Setting syncPeriod for debug sessions

During test or debug sessions where errors are inevitable, you can set a reasonable sync period for host-os-modules-controller to avoid manual recreation of hocm objects.

To enable the option, set the syncPeriod parameter in the spec:providerSpec:value:kaas:regional:helmReleases: section of the management Cluster object:

spec:
  providerSpec:
    value:
      kaas:
        regional:
        - provider: baremetal
          helmReleases:
          - name: host-os-modules-controller
            values:
              syncPeriod: 2m

Normally, syncPeriod is not required in the cluster settings. Therefore, you can remove this option after completing a debug session.

Troubleshoot the HostOSConfiguration object

After creation of a hoc object with various configurations, perform the following steps with reference to HostOSConfiguration status:

  • Verify that the .status.isValid field has the true value.

  • Verify that the .status.configs[*].error fields are absent.

  • Verify that all .status.machinesStates.<machineName>.configStateItemsStatuses have no Failed status.

Also, verify the LCM-related objects:

  • Verify that the corresponding LCMCluster object has all related StateItems.

  • Verify that all selected LCMMachines have the .spec.stateItemsOverwrites field, in which all StateItems from the previous step are present.

  • Verify that all StateItems from the previous step have been successfully processed by lcm-agent. Otherwise, a manual intervention is required.

To address an issue with a specific StateItem for which the lcm-agent is reporting an error, log in to the corresponding node and inspect Ansible execution logs:

ssh -i <path-to-ssh-key> mcc-user@<ip-addr-of-the-node>
sudo -i
cd /var/log/lcm/runners/
# from 2 directories, select the one
# with subdirectories having 'host-os-' prefix
cd <selected-dir>/<name-of-the-erroneous-state-item>
less <logs-file>

After the inspection, either resolve the issue manually or escalate the issue to Mirantis support.

Enable log debugging

The day-2 operations API allows enabling logs of debug level, which is integrated into the baremetal-provider controller and host-os-modules-controller. Both may be helpful during debug sessions.

To enable log debugging in host-os-modules-controller, add the following snippet to the Cluster object:

providerSpec:
# ...
  value:
  # ...
    kaas:
      regional:
      - helmReleases:
        - name: host-os-modules-controller
          values:
            logLevel: 2

To enable log debugging in baremetal-provider, add the following snippet to the Cluster object:

providerSpec:
# ...
  value:
  # ...
    kaas:
      regional:
      - helmReleases:
        - name: baremetal-provider
          values:
            cluster_api_provider_baremetal:
              log:
                verbosity: 3

To obtain the logs related to day-2 operations in baremetal-provider, filter them by the .host-os. key:

kubectl logs -n kaas <baremetal-provider-pod> | grep ".host-os."