This section describes how to enable a watchdog in your MCP cluster and applies to both existing and new MCP deployments.
Note
This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.
The watchdog detects and recovers servers from serious malfunctions which can include hardware faults as well as program errors. While operating normally, the server resets the watchdog preventing it from generating a timeout signal. Otherwise, the watchdog initiates corrective actions to restore the normal operation of a system.
This functionality can be implemented through either a watchdog timer, which is a hardware device, or a software-only softdog driver.
To install and configure the watchdog:
Log in to the Salt Master node.
In the classes/cluster/<cluster_name>/init.yml
or
classes/cluster/<cluster_name>/init/init.yml
file of your Reclass model,
include the following class:
classes:
- system.watchdog.server
In the classes/cluster/<cluster_name>/infra/config.yml
file of your
Reclass model, add the watchdog server configuration. For example:
watchdog:
server:
admin: root
enabled: true
interval: 1
log_dir: /var/log/watchdog
realtime: yes
timeout: 60
device: /dev/watchdog
# Salt Stack will automatically detect the necessary kernel module
# which needs to be loaded (ex. hpwdt, iTCO_wdt).
# If the hardware model is not predefined in map.jinja, the default
# watchdog driver is used: softdog
# You may specify the kernel module parameters if needed:
kernel:
parameter:
soft_panic: 1
parameter: value
parameter_only_without_value: none
Select from the following options:
If you are performing the initial deployment of your environment, the watchdog service will be installed during the Finalize stage of the Deploy - OpenStack pipeline. See Deploy an OpenStack environment for details.
If you are enabling the watchdog service in an existing environment, apply the changes to the deployment model to install the service:
salt \* state.sls watchdog
Verify that the watchdog service is enabled in your deployment:
salt \* cmd.run "service watchdog status"