NTP

NTP

This section describes the alerts for the NTP service.

NtpOffsetTooHigh

Severity Warning
Summary The NTP offset on the {{ $labels.host }} node is more than 200 milliseconds for 2 minutes.
Raise condition ntpq_offset >= 200
Description Raises when the NTP offset on a node reaches the threshold of 200 milliseconds for 2 minutes, typically indicating that the host fails to synchronize the time with the NTP server or the NTP server is malfunctioning. A too high offset affects the metrics collection and querying the time series database. The host label in the raised alert contains the host name of the affected node.
Troubleshooting

Synchronize the time with a properly operating NTP server:

  1. Enter the NTP CLI by running ntpq on the affected node.
  2. List the NTP peers by running peers and exit the NTP CLI.
  3. Set the date and time using ntpdate -q <peer_from_list>.

If the issue persists:

  1. Enter the NTP CLI by running ntpq on the affected node.
  2. List the associations by running as.
  3. Investigate the reason for the server rejection by running rv <association_id> with a chosen association ID.
  4. Inspect the output for the occurrence of flash code, rootdispersion, dispersion, and jitter. Avoid syncing with servers that have a large dispersion.
Tuning

For example, to change the threshold of the NTP offset to 500:

  1. On the cluster level of the Reclass model, create a common file for all alert customizations. Skip this step to use an existing defined file.

    1. Create a file for alert customizations:

      touch cluster/<cluster_name>/stacklight/custom/alerts.yml
      
    2. Define the new file in cluster/<cluster_name>/stacklight/server.yml:

      classes:
      - cluster.<cluster_name>.stacklight.custom.alerts
      ...
      
  2. In the defined alert customizations file, modify the alert threshold by overriding the if parameter:

    parameters:
     prometheus:
        server:
          alert:
            NtpOffsetTooHigh:
              if: >-
                ntpq_offset >= 500
    
  3. From the Salt Master node, apply the changes:

    salt 'I@prometheus:server' state.sls prometheus.server
    
  4. Verify the updated alert definition in the Prometheus web UI.