Enable a watchdog

Enable a watchdogΒΆ

This section describes how to enable a watchdog in your MCP cluster and applies to both existing and new MCP deployments.


This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.

The watchdog detects and recovers servers from serious malfunctions which can include hardware faults as well as program errors. While operating normally, the server resets the watchdog preventing it from generating a timeout signal. Otherwise, the watchdog initiates corrective actions to restore the normal operation of a system.

This functionality can be implemented through either a watchdog timer, which is a hardware device, or a software-only softdog driver.

To install and configure the watchdog:

  1. Log in to the Salt Master node.

  2. In the classes/cluster/<cluster_name>/init.yml or classes/cluster/<cluster_name>/init/init.yml file of your Reclass model, include the following class:

    - system.watchdog.server
  3. In the classes/cluster/<cluster_name>/infra/config.yml file of your Reclass model, add the watchdog server configuration. For example:

        admin: root
        enabled: true
        interval: 1
        log_dir: /var/log/watchdog
        realtime: yes
        timeout: 60
        device: /dev/watchdog
        # Salt Stack will automatically detect the necessary kernel module
        # which needs to be loaded (ex. hpwdt, iTCO_wdt).
        # If the hardware model is not predefined in map.jinja, the default
        # watchdog driver is used: softdog
        # You may specify the kernel module parameters if needed:
            soft_panic: 1
            parameter: value
            parameter_only_without_value: none
  4. Select from the following options:

    • If you are performing the initial deployment of your environment, the watchdog service will be installed during the Finalize stage of the Deploy - OpenStack pipeline. See Deploy an OpenStack environment for details.

    • If you are enabling the watchdog service in an existing environment, apply the changes to the deployment model to install the service:

      salt \* state.sls watchdog
  5. Verify that the watchdog service is enabled in your deployment:

    salt \* cmd.run "service watchdog status"