The network driver may fail to allocate kernel memory. You may also detect the following symptoms of the issue:
kern.log
related to the BNX driverTo prevent the issue, calculate the sysctl minimum reserved memory and set it
using the vm.min_free_kbytes
parameter for each type of node depending on
your cluster model.
Caution
For performance reasons, verify that the value set for
vm.min_free_kbytes
does not exceed 5% of the entire memory.
Warning
Perform the steps below before the deployment of an OpenStack environment. For existing environments, first, accomplish the procedure on a staging environment. If the staging environment does not exist, adapt the exact cluster model and launch it inside the cloud as a heat stack, which will act as a staging environment.
Workaround:
Open your Git project repository with the Reclass model on the cluster level.
In /etc/sysctl.conf
specify the following pillar:
linux:
system:
kernel:
sysctl:
vm.min_free_kbytes: <calculated_value>
Choose from the following options:
If you are making changes before the deployment, proceed with further configuration as required.
If you are making changes to an existing environment, apply the changes:
Log in to the Salt Master node.
Apply the following state:
salt '*' state.apply linux.system.kernel
The Salt Master CA does not provide the Certificate Revocation List (CRL) and index files to identify the revoked or expired certificates.
Workaround:
To list all currently issued certificates, follow the step 3 of the Replace the Salt Master CA certificates procedure.
During the upgrade of an MCP cluster, after the installation
of the salt-master
, salt-common
, salt-api
, and salt-minion
packages, the Deploy - update cloud pipeline may hang up with
the Connection refused error message and trying to connect to
salt-api
.
Workaround:
Log in to the Salt Master node.
Restart the salt-api
service:
systemctl restart salt-api.service
Rerun the Deploy - update cloud pipeline.
When changing any network settings (routes, up_cmds
commands,
MTU), the linux.network
formula restarts the target interface and all
related interfaces. For example, when changes are related to
a bridge interface, all its interfaces will be restarted what leads to
VMs failures. Therefore, Mirantis recommends configuring all required
bridge interfaces on KVMs before a cluster deployment.
The workaround is to apply all required settings manually without a bridge restart. If a bridge restart on a KVM node is crucial:
Fixed in 2019.2.3
Occasionally, the deployment of OpenContrail v4.x with OpenStack
Pike may fail due to the duplication of the salt-minion
services.
Workaround:
Log in to the Salt Master node.
Apply the following state:
salt -t 10 "rgw*" cmd.run 'pkill -9 salt-minion'
The service restarts automatically in a few minutes.
The CVP - Sanity checks Jenkins pipeline may fail if
the TEST_REPO
parameter is not empty.
Workaround:
Leave the TEST_REPO
parameter empty. This option is deprecated starting
MCP Build ID 2019.2.0.
When commissioning nodes with Intel X520-2 10 GB Ethernet Network Interface Cards (NICs), such cards may not be discovered.
Workaround:
Do not use Intel X520-2 10GB NICs with firmware version 0x30030001
.
When upgrading from the MCP Build ID 2018.11.0 to 2019.2.0, the Deploy - upgrade MCP DriveTrain Jenkins pipeline job fails due to the mirror jobs failing to trigger the newest version.
Workaround:
release/2019.2.0
.false
.Fixed in 2019.2.3
Creating instant backups using Backupninja, Xtrabackup, ZooKeeper, or Cassandra may fail due to an issue with permissions.
Workaround:
Log in to the Salt Master node.
Obtain the SSH RSA key specified in /root/.ssh/id_rsa.pub
.
On the system level of the Reclass model, add the obtained SSH RSA key
to system/<service_name>/server.yml
for Backupninja or Xtrabackup or
to system/<service_name>/backup/server.yml
for Cassandra or ZooKeeper.
For example, for Backupninja add the following pillar to
system/backupninja/server.yml
.
parameters:
backupninja:
server:
key:
backupninja_pub_key:
enabled: true
key: <key_from_/root/id_rsa.pub>
Apply the corresponding service state. For example, for Backupninja apply the following state on the nodes with the Backupninja pillar defined:
salt -C 'I@backupninja:client or I@backupninja:server' state.sls backupninja
Warning
Since the steps above presuppose manual changes to the system level of the Reclass model, the changes will be removed in case of a system upgrade and you may need to apply them again.
When performing operations through Jenkins that require the Salt Minion package update and restart, for example, MCP DriveTrain upgrade, a cloud environment update, packages update, and so on, Jenkins pipeline jobs may fail due to the known community dbus-daemon issue.
Workaround:
On the Salt Master node, run:
systemctl daemon-reexec
systemctl restart salt-minion
Log in to the Jenkins web UI.
Re-run the failed Jenkins pipeline job.
Occasionally, application of the Salt states across all nodes during the
deployment pipelines execution fails with
Pepper error: Server error
. The issue affects large deployments with a big
number of Salt Minions and may affect the services deployment during the
later deployment steps.
To workaround the issue, select from the following options:
Enable the Salt batching for the affected Salt states.
For example, if the linux.system state fails, apply
the following patch to the pipeline-library
repository:
diff --git a/src/com/mirantis/mk/Orchestrate.groovy b/src/com/mirantis/mk/Orchestrate.groovy
index 509fe87..575d6ca 100644
--- a/src/com/mirantis/mk/Orchestrate.groovy
+++ b/src/com/mirantis/mk/Orchestrate.groovy
@@ -44,7 +44,7 @@ def installFoundationInfra(master, staticMgmtNet=false, extra_tgt = '') {
} catch (Throwable e) {
common.warningMsg('Salt state salt.minion.base is not present in the Salt-formula yet.')
}
- salt.enforceState([saltId: master, target: "* ${extra_tgt}", state: ['linux.system'], retries: 2])
+ salt.enforceState([saltId: master, target: "* ${extra_tgt}", state: ['linux.system'], batch: '15', retries: 2])
if (staticMgmtNet) {
salt.runSaltProcessStep(master, "* ${extra_tgt}", 'cmd.shell', ["salt-call state.sls linux.network; salt-call service.restart salt-minion"], null, true, 60)
}
The patch sets the batch size to 15% of the target nodes that include
the "* ${extra_tgt}"
nodes. In the absence of additional conditions,
the state will be applied to the 15% of the total number of these nodes.
Manually re-run the failed state. For example, if the salt.minion state fails, perform the following steps:
Log in to the Salt Master node.
Re-apply the failed state on the affected nodes manually:
salt '*' state.sls salt.minion
Restart the salt-minion
service manually:
salt '*' cmd.run 'salt-call service.restart salt-minion'
salt '*' saltutil.clear_cache
salt '*' saltutil.refresh_pillar
salt '*' saltutil.sync_all
During the restart of the salt-minion
service,
verify that the Salt Master node does not catch the exception with
getting the lost minion.
Restart the failed pipeline to proceed with update, deployment, or another required operation.
The values of the net.ipv4.neigh.default.gc_thresh1
,
net.ipv4.neigh.default.gc_thresh2
, and
net.ipv4.neigh.default.gc_thresh3
kernel parameters in pillars may differ
from the ones in the output of the sysctl
command on the mon*
and
ctl*
nodes because of the specific values hardcoded in
Docker.
Workaround:
Open your Git project repository with the Reclass model on the cluster level.
In classes/cluster/<cluster_name>/cicd/control/init.yml
and
classes/cluster/<cluster_name>/infra/config/docker.yml
, add the
following pillar:
linux:
system:
kernel:
# hardcoded in overlay network driver https://github.com/docker/libnetwork/pull/1789/files
sysctl:
net.ipv4.neigh.default.gc_thresh1: 8192
net.ipv4.neigh.default.gc_thresh2: 49152
net.ipv4.neigh.default.gc_thresh3: 65536
If you have StackLight enabled, also add the same pillar to
classes/cluster/<cluster_name>/stacklight/server.yml
.
Apply the changes:
salt '*' saltutil.sync_all
When the Open vSwitch (OVS) network interfaces have the same MAC address, for example, when a bond interface is split into several vLANs with tags, OVS prior to version 2.10 may not add flow rules to some OVS bridges.
Workaround:
Choose from the following options:
Add a unique MAC address to the ports description. For example:
bond1.${_param:aint_public_vlan}:
name: bond1.${_param:aint_public_vlan}
enabled: true
proto: manual
type: ovs_port
bridge: br-aint_public
ovs_bridge: br-aint_public
hwaddress: <unique_mac>
ovs_port_type: OVSPort
use_interfaces:
bond1
Use the following configuration order:
Use external networks:
provider:segmentation_id
for each Neutron network.The Deploy - upgrade control VMs Jenkins pipeline job may fail with the HTTP Error 504: Gateway Time-out error message. The workaround is to increase the timeout for NGINX.
Workaround:
Open your Git project repository with the Reclass model on the cluster level.
In classes/cluster/<cluster_name>/infra/config/init.yml
, increase the
timeout for NGINX:
nginx:
server:
site:
nginx_proxy_salt_api:
proxy:
timeout: 1000
Apply the following state:
salt -C 'I@salt:master' saltutil.sync_all
salt -C 'I@salt:master' state.sls nginx.server
The service.enable
Salt module does not enable the nova-novncproxy
service if it was disabled using systemd.
Workaround:
Due to the specifics of SaltStack version 2017.7, when using the
x509.py
module, Salt ignores the test=True
option and applies the
changes.