DriveTrain

DriveTrain


15644

The network driver may fail to allocate kernel memory. You may also detect the following symptoms of the issue:

  • Traces in kern.log related to the BNX driver
  • Ceph OSD flapping in the Ceph cluster during a rebalance

To prevent the issue, calculate the sysctl minimum reserved memory and set it using the vm.min_free_kbytes parameter for each type of node depending on your cluster model.

Caution

For performance reasons, verify that the value set for vm.min_free_kbytes does not exceed 5% of the entire memory.

Warning

Perform the steps below before the deployment of an OpenStack environment. For existing environments, first, accomplish the procedure on a staging environment. If the staging environment does not exist, adapt the exact cluster model and launch it inside the cloud as a heat stack, which will act as a staging environment.

Workaround:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In /etc/sysctl.conf specify the following pillar:

    linux:
      system:
        kernel:
          sysctl:
            vm.min_free_kbytes: <calculated_value>
    
  3. Choose from the following options:

    • If you are making changes before the deployment, proceed with further configuration as required.

    • If you are making changes to an existing environment, apply the changes:

      1. Log in to the Salt Master node.

      2. Apply the following state:

        salt '*' state.apply linux.system.kernel
        

21033

The Salt Master CA does not provide the Certificate Revocation List (CRL) and index files to identify the revoked or expired certificates.

Workaround:

To list all currently issued certificates, follow the step 3 of the Replace the Salt Master CA certificates procedure.


24868

During the upgrade of an MCP cluster, after the installation of the salt-master, salt-common, salt-api, and salt-minion packages, the Deploy - update cloud pipeline may hang up with the Connection refused error message and trying to connect to salt-api.

Workaround:

  1. Log in to the Salt Master node.

  2. Restart the salt-api service:

    systemctl restart salt-api.service
    
  3. Rerun the Deploy - update cloud pipeline.


25172

When changing any network settings (routes, up_cmds commands, MTU), the linux.network formula restarts the target interface and all related interfaces. For example, when changes are related to a bridge interface, all its interfaces will be restarted what leads to VMs failures. Therefore, Mirantis recommends configuring all required bridge interfaces on KVMs before a cluster deployment.

The workaround is to apply all required settings manually without a bridge restart. If a bridge restart on a KVM node is crucial:

  1. Plan a maintenance window for your MCP cluster.
  2. Stop all VMs of a node that requires a bridge restart.
  3. Apply the required settings changes.
  4. Restart the bridge interface.
  5. Start all VMs.

26113

Fixed in 2019.2.3

Occasionally, the deployment of OpenContrail v4.x with OpenStack Pike may fail due to the duplication of the salt-minion services.

Workaround:

  1. Log in to the Salt Master node.

  2. Apply the following state:

    salt -t 10 "rgw*" cmd.run 'pkill -9 salt-minion'
    

The service restarts automatically in a few minutes.


26330

The CVP - Sanity checks Jenkins pipeline may fail if the TEST_REPO parameter is not empty.

Workaround:

Leave the TEST_REPO parameter empty. This option is deprecated starting MCP Build ID 2019.2.0.


26417

When commissioning nodes with Intel X520-2 10 GB Ethernet Network Interface Cards (NICs), such cards may not be discovered.

Workaround:

Do not use Intel X520-2 10GB NICs with firmware version 0x30030001.


27010

When upgrading from the MCP Build ID 2018.11.0 to 2019.2.0, the Deploy - upgrade MCP DriveTrain Jenkins pipeline job fails due to the mirror jobs failing to trigger the newest version.

Workaround:

  1. Log in to the Jenkins web UI.
  2. Run the git-mirror-downstream-mk-pipelines and git-mirror-downstream-pipeline-library Jenkins pipeline jobs with BRANCHES set to release/2019.2.0.
  3. Rerun the Deploy - upgrade MCP DriveTrain Jenkins pipeline job with UPDATE_PIPELINES set to false.

27135

Fixed in 2019.2.3

Creating instant backups using Backupninja, Xtrabackup, ZooKeeper, or Cassandra may fail due to an issue with permissions.

Workaround:

  1. Log in to the Salt Master node.

  2. Obtain the SSH RSA key specified in /root/.ssh/id_rsa.pub.

  3. On the system level of the Reclass model, add the obtained SSH RSA key to system/<service_name>/server.yml for Backupninja or Xtrabackup or to system/<service_name>/backup/server.yml for Cassandra or ZooKeeper. For example, for Backupninja add the following pillar to system/backupninja/server.yml.

    parameters:
      backupninja:
        server:
          key:
            backupninja_pub_key:
              enabled: true
              key: <key_from_/root/id_rsa.pub>
    
  4. Apply the corresponding service state. For example, for Backupninja apply the following state on the nodes with the Backupninja pillar defined:

    salt -C 'I@backupninja:client or I@backupninja:server' state.sls backupninja
    

Warning

Since the steps above presuppose manual changes to the system level of the Reclass model, the changes will be removed in case of a system upgrade and you may need to apply them again.


27638

When performing operations through Jenkins that require the Salt Minion package update and restart, for example, MCP DriveTrain upgrade, a cloud environment update, packages update, and so on, Jenkins pipeline jobs may fail due to the known community dbus-daemon issue.

Workaround:

  1. On the Salt Master node, run:

    systemctl daemon-reexec
    systemctl restart salt-minion
    
  2. Log in to the Jenkins web UI.

  3. Re-run the failed Jenkins pipeline job.


32633

Occasionally, application of the Salt states across all nodes during the deployment pipelines execution fails with Pepper error: Server error. The issue affects large deployments with a big number of Salt Minions and may affect the services deployment during the later deployment steps.

To workaround the issue, select from the following options:

  • Enable the Salt batching for the affected Salt states. For example, if the linux.system state fails, apply the following patch to the pipeline-library repository:

    diff --git a/src/com/mirantis/mk/Orchestrate.groovy b/src/com/mirantis/mk/Orchestrate.groovy
    index 509fe87..575d6ca 100644
    --- a/src/com/mirantis/mk/Orchestrate.groovy
    +++ b/src/com/mirantis/mk/Orchestrate.groovy
    @@ -44,7 +44,7 @@ def installFoundationInfra(master, staticMgmtNet=false, extra_tgt = '') {
         } catch (Throwable e) {
             common.warningMsg('Salt state salt.minion.base is not present in the Salt-formula yet.')
         }
    -    salt.enforceState([saltId: master, target: "* ${extra_tgt}", state: ['linux.system'], retries: 2])
    +    salt.enforceState([saltId: master, target: "* ${extra_tgt}", state: ['linux.system'], batch: '15', retries: 2])
         if (staticMgmtNet) {
             salt.runSaltProcessStep(master, "* ${extra_tgt}", 'cmd.shell', ["salt-call state.sls linux.network; salt-call service.restart salt-minion"], null, true, 60)
         }
    

    The patch sets the batch size to 15% of the target nodes that include the "* ${extra_tgt}" nodes. In the absence of additional conditions, the state will be applied to the 15% of the total number of these nodes.

  • Manually re-run the failed state. For example, if the salt.minion state fails, perform the following steps:

    1. Log in to the Salt Master node.

    2. Re-apply the failed state on the affected nodes manually:

      salt '*' state.sls salt.minion
      
    3. Restart the salt-minion service manually:

      salt '*' cmd.run 'salt-call service.restart salt-minion'
      salt '*' saltutil.clear_cache
      salt '*' saltutil.refresh_pillar
      salt '*' saltutil.sync_all
      

      During the restart of the salt-minion service, verify that the Salt Master node does not catch the exception with getting the lost minion.

    4. Restart the failed pipeline to proceed with update, deployment, or another required operation.


32079

The values of the net.ipv4.neigh.default.gc_thresh1, net.ipv4.neigh.default.gc_thresh2, and net.ipv4.neigh.default.gc_thresh3 kernel parameters in pillars may differ from the ones in the output of the sysctl command on the mon* and ctl* nodes because of the specific values hardcoded in Docker.

Workaround:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In classes/cluster/<cluster_name>/cicd/control/init.yml and classes/cluster/<cluster_name>/infra/config/docker.yml, add the following pillar:

    linux:
      system:
        kernel:
          # hardcoded in overlay network driver https://github.com/docker/libnetwork/pull/1789/files
          sysctl:
            net.ipv4.neigh.default.gc_thresh1: 8192
            net.ipv4.neigh.default.gc_thresh2: 49152
            net.ipv4.neigh.default.gc_thresh3: 65536
    
  3. If you have StackLight enabled, also add the same pillar to classes/cluster/<cluster_name>/stacklight/server.yml.

  4. Apply the changes:

    salt '*' saltutil.sync_all
    

28046

When the Open vSwitch (OVS) network interfaces have the same MAC address, for example, when a bond interface is split into several vLANs with tags, OVS prior to version 2.10 may not add flow rules to some OVS bridges.

Workaround:

Choose from the following options:

  • Add a unique MAC address to the ports description. For example:

    bond1.${_param:aint_public_vlan}:
    name: bond1.${_param:aint_public_vlan}
    enabled: true
    proto: manual
    type: ovs_port
    bridge: br-aint_public
    ovs_bridge: br-aint_public
    hwaddress: <unique_mac>
    ovs_port_type: OVSPort
    use_interfaces:
    
    bond1
    
  • Use the following configuration order:

    1. Plug the tagged interfaces into the Linux bridges.
    2. Connect the Linux bridges into the OVS bridges.
  • Use external networks:

    1. Pass the entire interface to the OVS bridge and map it to a single physical network.
    2. Split the interface on vLANs by setting provider:segmentation_id for each Neutron network.

34308

The Deploy - upgrade control VMs Jenkins pipeline job may fail with the HTTP Error 504: Gateway Time-out error message. The workaround is to increase the timeout for NGINX.

Workaround:

  1. Open your Git project repository with the Reclass model on the cluster level.

  2. In classes/cluster/<cluster_name>/infra/config/init.yml, increase the timeout for NGINX:

    nginx:
      server:
        site:
          nginx_proxy_salt_api:
            proxy:
              timeout: 1000
    
  3. Apply the following state:

    salt -C 'I@salt:master' saltutil.sync_all
    salt -C 'I@salt:master' state.sls nginx.server
    

35060

The service.enable Salt module does not enable the nova-novncproxy service if it was disabled using systemd.

Workaround:

  • Enable the service using systemd
  • Disable the service using Salt and then enable it also using Salt

34708

Due to the specifics of SaltStack version 2017.7, when using the x509.py module, Salt ignores the test=True option and applies the changes.