Verify the Instance HA service¶
The section provides instructions for cloud administrators on how to verify whether the Instance HA (Masakari) service has been correctly configured and will recover an instance from the process and compute node failures.
Verify recovery from an instance process failure¶
Confirm that your instance is protected by the Instance HA service as detailed in Enable High Availability for an instance. After applying the HA metadata, you can verify the service by simulating a process failure.
Identify the compute node where the instance is running. For example:
openstack server show DemoInstance01 |grep host
Example output:
+----------------------+-------------------------------------------------------+ | Field | Value | +----------------------+-------------------------------------------------------+ | OS-EXT-SRV-ATTR:host | vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt | +----------------------+-------------------------------------------------------+
Verify the mainenance status of the compute node. If the Masakari host object corresponding to your compute node is in maintenance mode, the service will ignore all failure events for instances running on it:
openstack segment host show <SEGMENT-ID> <HOST>
If the
on-maintenancefield is set toTrue, move the host out of maintenance to resume HA monitoring:openstack segment host update <SEGMENT-ID> <HOST> --on-maintenance False
Log in to the compute node where your instance is running and obtain the
qemuprocess associated with your instance name:ps xafu |grep qemu
Example output:
nova 5231 34.3 1.1 5459452 184712 ? Sl 07:39 0:18 | \_ /usr/bin/qemu-system-x86_64 -name guest=instance-00000002....
In this example, the PID is
5231.Simulate a sudden crash of the instance using the kill command to terminate the process:
kill -9 <PID>
Verify that the notification about the failure was generated and successfully processed:
openstack notification list
Initially, the status will appear as
running:+--------------------------------------+----------------------------+---------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ | notification_uuid | generated_time | status | type | source_host_uuid | payload | +--------------------------------------+----------------------------+---------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ | 2fb82a5c-9a8b-4cef-a06e-a737e1b565a0 | 2021-07-06T07:40:40.000000 | running | VM | 6f1bd5aa-0c21-446a-b6dd-c1b4d09759be | {'event': 'LIFECYCLE', 'instance_uuid': '165cdfaf-b9e5-42b2-bbb9-af9283a789ae', 'vir_domain_event': 'STOPPED_FAILED'} | +--------------------------------------+----------------------------+---------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
Wait a moment and check again until the status is
finished:+--------------------------------------+----------------------------+----------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ | notification_uuid | generated_time | status | type | source_host_uuid | payload | +--------------------------------------+----------------------------+----------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+ | 2fb82a5c-9a8b-4cef-a06e-a737e1b565a0 | 2021-07-06T07:40:40.000000 | finished | VM | 6f1bd5aa-0c21-446a-b6dd-c1b4d09759be | {'event': 'LIFECYCLE', 'instance_uuid': '165cdfaf-b9e5-42b2-bbb9-af9283a789ae', 'vir_domain_event': 'STOPPED_FAILED'} | +--------------------------------------+----------------------------+----------+------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------+
Verify that a new process has been started for your instance:
ps xafu |grep qemu
Example output:
root 8800 0.0 0.0 11488 1104 pts/1 S+ 07:41 0:00 | | \_ grep --color=auto qemu nova 8323 104 0.7 1262628 128936 ? Sl 07:40 0:09 | \_ /usr/bin/qemu-system-x86_64 -name guest=instance-00000002
Successfully verifying the recovery process confirms that your instance is correctly registered with the Instance HA service.
Verify recovery from a compute node failure¶
Destructive test: Compute node shutdown
This verification requires complete shutdown of a physical bare metal server, inevitably disrupting all instances running on it.
Impact (data loss):
The resulting instance evacuation is similar to a rebuild:
Instance memory state is permanently lost
Instance disk state is permanently lost if local storage is used
Operational prerequisite: You must have necessary management access to power the compute node back up after the test. Access is for recovery, not data loss mitigation.
Confirm that your instance is protected by the Instance HA service as detailed in Enable High Availability for an instance.
Identify the compute node where the instance is running. For example:
openstack server show DemoInstance01 |grep host
Example output:
+----------------------+-------------------------------------------------------+ | Field | Value | +----------------------+-------------------------------------------------------+ | OS-EXT-SRV-ATTR:host | vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt | +----------------------+-------------------------------------------------------+
Verify the mainenance status of the compute node. If the Masakari host object corresponding to your compute node is in maintenance mode, the service will ignore all failure events for instances running on it:
openstack segment host show <SEGMENT-ID> <HOST>
If the
on-maintenancefield is set toTrue, move the host out of maintenance to resume HA monitoring:openstack segment host update <SEGMENT-ID> <HOST> --on-maintenance False
Log in to the compute node and power it off.
After a while, verify that the instance has been evacuated:
openstack server show DemoInstance01 |grep host
Example output:
+----------------------+-------------------------------------------------------+ | Field | Value | +----------------------+-------------------------------------------------------+ | OS-EXT-SRV-ATTR:host | vs-ps-vyvsrkrdpusv-0-ukqbpy2pkcuq-server-s4u2thvgxdfi | +----------------------+-------------------------------------------------------+
Power the compute node back up.
Wait for the operating system to boot and confirm the compute node is up and functioning correctly.
Resume HA monitoring. When a compute node fails, the Instance HA service often places the corresponding Masakari host object into maintenance mode to prevent recovery loops during hardware instability. You must manually move the host out of maintenance so that the Instance HA service can begin processing failure events for this host again:
openstack segment host update <SEGMENT-ID> <HOST> --on_maintenance False
Successfully verifying the recovery process confirms that your instance is correctly registered with the Instance HA service.