Troubleshoot DriveTrain

Troubleshoot DriveTrain

This section instructs you on how to troubleshoot the DriveTrain issues.

Glusterd service restart failure

[32334] The Glusterd service does not restart automatically after its child processes failed or were unexpectedly killed.

To troubleshoot the issue:

  1. Log in to a KVM node.

  2. In the /lib/systemd/system/glusterd.service file, set the Restart option in the [Service] section:

    [Service]
    ...
    Restart=on-abort
    ...
    

    The recommended values include:

    • on-abort

      The service restarts only if the service process exits due to an uncaught signal not specified as a clean exit status.

    • on-failure

      The service restarts when the process exits with a non-zero exit code, is terminated by a signal including on core dump and excluding the aforementioned four signals, when an operation such as service reload times out, and when the configured watchdog timeout is triggered.

  3. Apply the changes:

    systemctl daemon-reload
    

    Note

    Re-apply the provided workaround if any of the GlusterFS packages has been re-installed or upgraded.

  4. Perform the steps above on the remaining KVM nodes.

Failure to connect to Jenkins master

[34848] Jenkins slaves may be unable to connect to Jenkins master during the update of MCP versions prior to 2019.2.4 to MCP maintenance update 2019.2.8 or newer. The issue may be related to the Cross Site Request Forgery (CSRF) protection configuration.

To verify whether your deployment is affected:

  1. From the Salt Master node, obtain the Jenkins credentials:

    salt -C 'I@jenkins:client and not I@salt:master' config.get jenkins
    
  2. Run the following command:

    curl -u <jenkins_admin_user>:<jenkins_admin_password> '<jenkins_url>//crumbIssuer/api/xml?xpath=concat(//crumbRequestField,":",//crumb)'
    

    If the system response is 404 Not found, proceed with the issue resolution below.

To apply the issue resolution:

  1. Log in to the Jenkins web UI.
  2. Navigate to Manage Jenkins > Configure Global Security.
  3. Under CSRF Protection, select Prevent Cross Site Request Forgery exploits and Default Crumb Issuer.
  4. Click Save.

Once done, Jenkins slaves automatically reconnect in a few seconds.

NGINX gateway timeout during the DriveTrain upgrade

[34798] The Deploy - upgrade MCP DriveTrain Jenkins pipeline job may fail with the Error with request: HTTP Error 504: Gateway Time-out error message. The issue may occur in huge environments when applying Salt states on all nodes due to a small NGINX timeout configured on the Salt Master node.

To apply the issue resolution:

Select one of the following options:

  • In the Deploy - upgrade MCP DriveTrain Jenkins pipeline job, set the SALT_MASTER_URL parameter to the Salt API endpoint http://<cfg_node_ip>:6969.

  • Increase the NGINX timeout:

    1. Open your Git project repository with the Reclass model on the cluster level.

    2. In classes/cluster/<cluster_name>/infra/config/init.yml, increase the NGINX timeout for the Salt API site. By default, the timeout is set to 600 seconds.

      parameters:
        nginx:
          server:
            site:
              nginx_proxy_salt_api:
                proxy:
                  timeout: <timeout>
      
    3. Refresh Salt pillars and apply the nginx state on Salt Master node:

      salt -C 'I@salt:master' saltutil.refresh_pillar
      salt -C 'I@salt:master' state.apply nginx
      
    4. Commit the changes to your local repository.

SaltReqTimeoutError during the DriveTrain upgrade

[34114] The Deploy - upgrade MCP DriveTrain Jenkins pipeline job may fail with the SaltReqTimeoutError in master zmq thread error message when executing the salt.minion state on several minions at the same time. Adjust the Salt Master configuration to improve its performance.

To adjust the Salt Master configuration:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In cluster/<cluster_name>/infra/config/init.yml, increase the values for the following parameters as required:

    gather_job_timeout

    The number of seconds to wait for the client to request information about the running jobs.

    worker_threads

    The number of threads to start to receive commands and replies from minions.

    sock_pool_size

    The pool size of Unix sockets. To avoid blocking waiting while writing data to a socket, a socket pool is supported for Salt applications. For example, a job or state with a large number of target host list can cause a long period of blocking waiting.

    zmq_backlog

    The number of messages in the ZeroMQ backlog queue.

    For example:

    parameters:
      salt:
        master:
          worker_threads: 40
          opts:
            gather_job_timeout: 100
            sock_pool_size: 15
            zmq_backlog: 3000
    
  3. Log in to the Salt Master node.

  4. Refresh Salt pillars and apply the salt.master state.

    salt-call saltutil.refresh_pillar
    salt-call state.apply salt.master
    
  5. Commit the changes to your local repository.