Verify the Instance HA service

The section provides instructions for cloud administrators on how to verify whether the Instance HA (Masakari) service has been correctly configured and will recover an instance from the process and compute node failures.

Verify recovery from an instance process failure

  1. Confirm that your instance is protected by the Instance HA service as detailed in Enable High Availability for an instance. After applying the HA metadata, you can verify the service by simulating a process failure.

  2. Identify the compute node where the instance is running. For example:

    openstack server show DemoInstance01 |grep host
    

    Example output:

    +----------------------+-------------------------------------------------------+
    | Field                | Value                                                 |
    +----------------------+-------------------------------------------------------+
    | OS-EXT-SRV-ATTR:host | vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt |
    +----------------------+-------------------------------------------------------+
    
  3. Verify the mainenance status of the compute node. If the Masakari host object corresponding to your compute node is in maintenance mode, the service will ignore all failure events for instances running on it:

    openstack segment host show <SEGMENT-ID> <HOST>
    

    If the on-maintenance field is set to True, move the host out of maintenance to resume HA monitoring:

    openstack segment host update <SEGMENT-ID> <HOST> --on-maintenance False
    
  4. Log in to the compute node where your instance is running and obtain the qemu process associated with your instance name:

    ps xafu |grep qemu
    

    Example output:

    nova      5231 34.3  1.1 5459452 184712 ?      Sl   07:39   0:18  |   \_ /usr/bin/qemu-system-x86_64 -name guest=instance-00000002....
    

    In this example, the PID is 5231.

  5. Simulate a sudden crash of the instance using the kill command to terminate the process:

    kill -9 <PID>
    
  6. Verify that the notification about the failure was generated and successfully processed:

    openstack notification list
    

    Initially, the status will appear as running:

    +--------------------------------------+----------------------------+---------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
    | notification_uuid                    | generated_time             | status  | type | source_host_uuid                     | payload                                                                                                               |
    +--------------------------------------+----------------------------+---------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
    | 2fb82a5c-9a8b-4cef-a06e-a737e1b565a0 | 2021-07-06T07:40:40.000000 | running | VM   | 6f1bd5aa-0c21-446a-b6dd-c1b4d09759be | {'event': 'LIFECYCLE', 'instance_uuid': '165cdfaf-b9e5-42b2-bbb9-af9283a789ae', 'vir_domain_event': 'STOPPED_FAILED'} |
    +--------------------------------------+----------------------------+---------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
    

    Wait a moment and check again until the status is finished:

    +--------------------------------------+----------------------------+----------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
    | notification_uuid                    | generated_time             | status   | type | source_host_uuid                     | payload                                                                                                               |
    +--------------------------------------+----------------------------+----------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
    | 2fb82a5c-9a8b-4cef-a06e-a737e1b565a0 | 2021-07-06T07:40:40.000000 | finished | VM   | 6f1bd5aa-0c21-446a-b6dd-c1b4d09759be | {'event': 'LIFECYCLE', 'instance_uuid': '165cdfaf-b9e5-42b2-bbb9-af9283a789ae', 'vir_domain_event': 'STOPPED_FAILED'} |
    +--------------------------------------+----------------------------+----------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
    
  7. Verify that a new process has been started for your instance:

    ps xafu |grep qemu
    

    Example output:

    root      8800  0.0  0.0  11488  1104 pts/1    S+   07:41   0:00  |   |   \_ grep --color=auto qemu
    nova      8323  104  0.7 1262628 128936 ?      Sl   07:40   0:09  |   \_ /usr/bin/qemu-system-x86_64 -name guest=instance-00000002
    

Successfully verifying the recovery process confirms that your instance is correctly registered with the Instance HA service.

Verify recovery from a compute node failure

Destructive test: Compute node shutdown

This verification requires complete shutdown of a physical bare metal server, inevitably disrupting all instances running on it.

Impact (data loss):

The resulting instance evacuation is similar to a rebuild:

  • Instance memory state is permanently lost

  • Instance disk state is permanently lost if local storage is used

Operational prerequisite: You must have necessary management access to power the compute node back up after the test. Access is for recovery, not data loss mitigation.

  1. Confirm that your instance is protected by the Instance HA service as detailed in Enable High Availability for an instance.

  2. Identify the compute node where the instance is running. For example:

    openstack server show DemoInstance01 |grep host
    

    Example output:

    +----------------------+-------------------------------------------------------+
    | Field                | Value                                                 |
    +----------------------+-------------------------------------------------------+
    | OS-EXT-SRV-ATTR:host | vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt |
    +----------------------+-------------------------------------------------------+
    
  3. Verify the mainenance status of the compute node. If the Masakari host object corresponding to your compute node is in maintenance mode, the service will ignore all failure events for instances running on it:

    openstack segment host show <SEGMENT-ID> <HOST>
    

    If the on-maintenance field is set to True, move the host out of maintenance to resume HA monitoring:

    openstack segment host update <SEGMENT-ID> <HOST> --on-maintenance False
    
  4. Log in to the compute node and power it off.

  5. After a while, verify that the instance has been evacuated:

    openstack server show DemoInstance01 |grep host
    

    Example output:

    +----------------------+-------------------------------------------------------+
    | Field                | Value                                                 |
    +----------------------+-------------------------------------------------------+
    | OS-EXT-SRV-ATTR:host | vs-ps-vyvsrkrdpusv-0-ukqbpy2pkcuq-server-s4u2thvgxdfi |
    +----------------------+-------------------------------------------------------+
    
  6. Power the compute node back up.

  7. Wait for the operating system to boot and confirm the compute node is up and functioning correctly.

  8. Resume HA monitoring. When a compute node fails, the Instance HA service often places the corresponding Masakari host object into maintenance mode to prevent recovery loops during hardware instability. You must manually move the host out of maintenance so that the Instance HA service can begin processing failure events for this host again:

    openstack segment host update <SEGMENT-ID> <HOST> --on_maintenance False
    

Successfully verifying the recovery process confirms that your instance is correctly registered with the Instance HA service.