Troubleshoot orphaned resource allocations¶
Available since MOSK 24.3
Orphaned resource allocations are entries in the Placement database that track resource consumption, but the corresponding consumer (instance) no longer exists on the compute nodes. As a result, the Nova scheduler mistakenly believes that compute nodes have more resources allocated than they actually have.
For example, orphaned resource allocations may occur when an instance
is evacuated from a hypervisor while the related nova-compute
service
is down.
This section provides instructions on how to resolve orphaned resource allocations in Nova if they are detected on compute nodes.
Detect orphaned allocations¶
Orphaned allocations are detected by the nova-placement-audit
CronJob that
runs every four hours.
The osdpl-exporter
service processes the nova-placement-audit
CronJob
output and exports current number of orphaned allocations to StackLight as an
osdpl_nova_audit_orphaned_allocations
metric. If the value of this metric
is greater than 0
, StackLight raises a major alert
NovaOrphanedAllocationsDetected
.
Collect logging data from the cluster¶
Obtain the mapping with IDs of resource providers and related orphaned consumers:
kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \ jq .orphaned_allocations.detected
Example of a system response:
{ "12ed66d0-00d8-40e5-a28b-19cdecd2211d": [ { "consumer": "1e616d60-bc5b-436d-8d71-503d15de5c55", "resources": { "DISK_GB": 5, "MEMORY_MB": 512, "VCPU": 1 } } ] }
Obtain the list of the
nova-compute
services that have issues with orphaned allocations:Obtain the UUIDs of the resource providers containing orphaned allocations:
rp_uuids=$(kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \ jq -c '.orphaned_allocations.detected|keys')
Obtain the hostnames of the compute nodes that correspond to the resource providers obtained in the previous step:
cmp_fqdns=$(kubectl -n openstack exec -t deployment/keystone-client -- \ openstack resource provider list -f json | \ jq --argjson rp_uuids "$rp_uuids" -c ' .[] | select( [.uuid] | inside($rp_uuids) ) | .name') cmp_hostnames=$(for n in $(echo ${cmp_fqdns} | tr -d \"); do echo ${n%%.*}; done)
List the
nova-compute
services that contain orphaned allocations:kubectl -n openstack exec -t deployment/keystone-client -- \ openstack compute service list --service nova-compute --long -f json | \ jq --arg hosts "$cmp_hostnames" -r '.[] | select( .Host | inside($hosts) )'
Example of a system response:
[{ "ID": "14a1685a-798e-40f1-b490-a09a5c8f6f66", "Binary": "nova-compute", "Host": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk", "Zone": "nova", "Status": "enabled", "State": "down", "Updated At": "2024-08-12T07:52:46.000000", "Disabled Reason": null, "Forced Down": true }]
Analyze the list of the
nova-compute
services obtained during the previous step:For the
nova-compute
services in thedown
state, most probably there were evacuations of instances from the correspoding nodes when the services were down. If this is the case, proceed directly to Remove orphaned allocations. Otherwise, proceed with collecting the logs.To verify if the evacuations were performed:
openstack server migration list --type evacuation --host <CMP_HOSTNAME> -f json
Example of a system response:
[{ "Id": 3, "UUID": "d7c29e99-2f69-4f85-80ed-72e1ef71c099", "Source Node": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao.cluster.local", "Dest Node": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk.cluster.local", "Source Compute": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao", "Dest Compute": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk", "Dest Host": "10.10.0.61", "Status": "completed", "Server UUID": "1e616d60-bc5b-436d-8d71-503d15de5c55", "Old Flavor": null, "New Flavor": null, "Type": "evacuation", "Created At": "2024-08-07T09:01:08.000000", "Updated At": "2024-08-07T16:11:54.000000" }]
For the
nova-compute
services in theUP
state, proceed with collecting the logs.
Collect the following logs from the environment:
Caution
The log data can be significant in size. Ensure that there is sufficient space available in the
/tmp/
directory of the OpenStack controller pod. Create a separate report for each node.Logs from compute nodes for a 3-day period around the time of the alert:
From the node with the orphaned allocation
From the node with the actual allocation (where the instance exists, if any)
kubectl -n osh-system exec -it deployment/openstack-controller -- bash osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \ --host <CMP_HOSTNAME> \ --component nova \ --collector elastic \ --collector nova \ --workspace /tmp/ report
For example, if the alert was raised on 2024-08-12, set
<REPORT_PERIOD_TIMESTAMPS>
to2024-08-11,2024-08-13
.Logs from the
nova-scheduler
,nova-api
,nova-conductor
,placement-api
pods for a 3-day period around the time of the alert:ctl_nodes=$(kubectl get nodes -l openstack-control-plane=enabled -o name) kubectl -n osh-system exec -it deployment/openstack-controller -- bash # for each node in ctl_nodes execute: osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \ --host <CTL_HOSTNAME> \ --component nova \ --component placement \ --collector elastic \ --workspace /tmp/ report
Logs from the Kubernetes objects:
kubectl -n osh-system exec -it deployment/openstack-controller -- bash osctl sos --collector k8s --workspace /tmp/ report
Nova service data from the API:
kubectl -n openstack exec -it deployment/keystone-client -- bash openstack server migration list openstack compute service list --long openstack resource provider list # Get the server event list for each orphaned consumer id openstack server event list <SERVER_ID>
Note
SERVER_ID
is the orphaned consumer ID from thenova-placement-audit
logs.
Create a support case and attach the obtained information.
Remove orphaned allocations¶
Log in to the
nova-api-osapi
pod:kubectl -n openstack exec -it deployment/nova-api-osapi -- bash
Remove orphaned allocations:
To remove all found orphaned allocations:
nova-manage placement audit --verbose --delete
To remove orphaned allocations on a specific resource provider:
nova-manage placement audit --verbose --delete --resource_provider <RESOURCE_PROVIDER_UUID>
Verify that no orphaned allocations exist:
nova-manage placement audit --verbose