Troubleshoot DriveTrain

Troubleshoot DriveTrain¶

This section instructs you on how to troubleshoot the DriveTrain issues.

Glusterd service restart failure
Failure to connect to Jenkins master
NGINX gateway timeout during the DriveTrain upgrade
SaltReqTimeoutError during the DriveTrain upgrade

Glusterd service restart failure¶

[32334] The Glusterd service does not restart automatically after its child processes failed or were unexpectedly killed.

To troubleshoot the issue:

Log in to a KVM node.
In the /lib/systemd/system/glusterd.service file, set the Restart option in the [Service] section:
```
[Service]
...
Restart=on-abort
...
```
The recommended values include:
- on-abort
  The service restarts only if the service process exits due to an uncaught signal not specified as a clean exit status.
- on-failure
  The service restarts when the process exits with a non-zero exit code, is terminated by a signal including on core dump and excluding the aforementioned four signals, when an operation such as service reload times out, and when the configured watchdog timeout is triggered.
Apply the changes:
```
systemctl daemon-reload
```
Note

Re-apply the provided workaround if any of the GlusterFS packages has been re-installed or upgraded.
Perform the steps above on the remaining KVM nodes.

Failure to connect to Jenkins master¶

[34848] Jenkins slaves may be unable to connect to Jenkins master during the update of MCP versions prior to 2019.2.4 to MCP maintenance update 2019.2.8 or newer. The issue may be related to the Cross Site Request Forgery (CSRF) protection configuration.

To verify whether your deployment is affected:

From the Salt Master node, obtain the Jenkins credentials:

salt -C 'I@jenkins:client and not I@salt:master' config.get jenkins

Run the following command:

curl -u <jenkins_admin_user>:<jenkins_admin_password> '<jenkins_url>//crumbIssuer/api/xml?xpath=concat(//crumbRequestField,":",//crumb)'

If the system response is 404 Not found, proceed with the issue resolution below.

To apply the issue resolution:

Log in to the Jenkins web UI.
Navigate to Manage Jenkins > Configure Global Security.
Under CSRF Protection, select Prevent Cross Site Request Forgery exploits and Default Crumb Issuer.
Click Save.

Once done, Jenkins slaves automatically reconnect in a few seconds.

NGINX gateway timeout during the DriveTrain upgrade¶

[34798] The Deploy - upgrade MCP DriveTrain Jenkins pipeline job may fail with the Error with request: HTTP Error 504: Gateway Time-out error message. The issue may occur in huge environments when applying Salt states on all nodes due to a small NGINX timeout configured on the Salt Master node.

To apply the issue resolution:

Select one of the following options:

In the Deploy - upgrade MCP DriveTrain Jenkins pipeline job, set the SALT_MASTER_URL parameter to the Salt API endpoint http://<cfg_node_ip>:6969.
Increase the NGINX timeout:
1. Open your Git project repository with the Reclass model on the cluster level.
2. In classes/cluster/<cluster_name>/infra/config/init.yml, increase the NGINX timeout for the Salt API site. By default, the timeout is set to 600 seconds.
```
parameters:
  nginx:
    server:
      site:
        nginx_proxy_salt_api:
          proxy:
            timeout: <timeout>
```
3. Refresh Salt pillars and apply the nginx state on Salt Master node:
```
salt -C 'I@salt:master' saltutil.refresh_pillar
salt -C 'I@salt:master' state.apply nginx
```
4. Commit the changes to your local repository.

SaltReqTimeoutError during the DriveTrain upgrade¶

[34114] The Deploy - upgrade MCP DriveTrain Jenkins pipeline job may fail with the SaltReqTimeoutError in master zmq thread error message when executing the salt.minion state on several minions at the same time. Adjust the Salt Master configuration to improve its performance.

To adjust the Salt Master configuration:

Open your Git project repository with the Reclass model on the cluster level.
In cluster/<cluster_name>/infra/config/init.yml, increase the values for the following parameters as required:
gather_job_timeout
The number of seconds to wait for the client to request information about the running jobs.

worker_threads
The number of threads to start to receive commands and replies from minions.

sock_pool_size
The pool size of Unix sockets. To avoid blocking waiting while writing data to a socket, a socket pool is supported for Salt applications. For example, a job or state with a large number of target host list can cause a long period of blocking waiting.

zmq_backlog
The number of messages in the ZeroMQ backlog queue.

For example:
parameters: salt: master: worker_threads: 40 opts: gather_job_timeout: 100 sock_pool_size: 15 zmq_backlog: 3000
Log in to the Salt Master node.

Refresh Salt pillars and apply the salt.master state.

salt-call saltutil.refresh_pillar
salt-call state.apply salt.master

Commit the changes to your local repository.

updated: 2025-01-10 08:56

Troubleshooting

View Previous Section

Troubleshoot an MCP OpenStack environment