Troubleshoot orphaned resource allocations¶

Available since MOSK 24.3

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

Orphaned resource allocations are entries in the Placement database that track resource consumption, but the corresponding consumer (instance) no longer exists on the compute nodes. As a result, the Nova scheduler mistakenly believes that compute nodes have more resources allocated than they actually have.

For example, orphaned resource allocations may occur when an instance is evacuated from a hypervisor while the related nova-compute service is down.

This section provides instructions on how to resolve orphaned resource allocations in Nova if they are detected on compute nodes.

Detect orphaned allocations¶

Orphaned allocations are detected by the nova-placement-audit CronJob that runs every four hours.

The osdpl-exporter service processes the nova-placement-audit CronJob output and exports current number of orphaned allocations to StackLight as an osdpl_nova_audit_orphaned_allocations metric. If the value of this metric is greater than 0, StackLight raises a major alert NovaOrphanedAllocationsDetected.

Collect logging data from the cluster¶

Obtain the mapping with IDs of resource providers and related orphaned consumers:

kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \
   jq .orphaned_allocations.detected

Example of a system response:

{
   "12ed66d0-00d8-40e5-a28b-19cdecd2211d": [
      {
         "consumer": "1e616d60-bc5b-436d-8d71-503d15de5c55",
         "resources": {
         "DISK_GB": 5,
         "MEMORY_MB": 512,
         "VCPU": 1
         }
      }
   ]
}

Obtain the list of the nova-compute services that have issues with orphaned allocations:

Obtain the UUIDs of the resource providers containing orphaned allocations:

rp_uuids=$(kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \
   jq -c '.orphaned_allocations.detected|keys')

Obtain the hostnames of the compute nodes that correspond to the resource providers obtained in the previous step:

cmp_fqdns=$(kubectl -n openstack exec -t deployment/keystone-client -- \
   openstack resource provider list -f json | \
   jq --argjson rp_uuids "$rp_uuids" -c ' .[] | select( [.uuid] | inside($rp_uuids) ) | .name')
cmp_hostnames=$(for n in $(echo ${cmp_fqdns} | tr -d \"); do echo ${n%%.*}; done)

List the nova-compute services that contain orphaned allocations:

kubectl -n openstack exec -t deployment/keystone-client -- \
   openstack compute service list --service nova-compute --long -f json | \
   jq --arg hosts "$cmp_hostnames" -r '.[] | select( .Host | inside($hosts) )'

Example of a system response:

[{
   "ID": "14a1685a-798e-40f1-b490-a09a5c8f6f66",
   "Binary": "nova-compute",
   "Host": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk",
   "Zone": "nova",
   "Status": "enabled",
   "State": "down",
   "Updated At": "2024-08-12T07:52:46.000000",
   "Disabled Reason": null,
   "Forced Down": true
}]

Analyze the list of the nova-compute services obtained during the previous step:

For the nova-compute services in the down state, most probably there were evacuations of instances from the correspoding nodes when the services were down. If this is the case, proceed directly to Remove orphaned allocations. Otherwise, proceed with collecting the logs.

To verify if the evacuations were performed:

openstack server migration list --type evacuation --host <CMP_HOSTNAME> -f json

Example of a system response:

[{
   "Id": 3,
   "UUID": "d7c29e99-2f69-4f85-80ed-72e1ef71c099",
   "Source Node": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao.cluster.local",
   "Dest Node": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk.cluster.local",
   "Source Compute": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao",
   "Dest Compute": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk",
   "Dest Host": "10.10.0.61",
   "Status": "completed",
   "Server UUID": "1e616d60-bc5b-436d-8d71-503d15de5c55",
   "Old Flavor": null,
   "New Flavor": null,
   "Type": "evacuation",
   "Created At": "2024-08-07T09:01:08.000000",
   "Updated At": "2024-08-07T16:11:54.000000"
}]

For the nova-compute services in the UP state, proceed with collecting the logs.

Collect the following logs from the environment:

Caution

The log data can be significant in size. Ensure that there is sufficient space available in the /tmp/ directory of the OpenStack Controller (Rockoon) pod. Create a separate report for each node.

Logs from compute nodes for a 3-day period around the time of the alert:
- From the node with the orphaned allocation
- From the node with the actual allocation (where the instance exists, if any)
```
kubectl -n osh-system exec -it deployment/rockoon -- bash

osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \
   --host <CMP_HOSTNAME> \
   --component nova \
   --collector elastic \
   --collector nova \
   --workspace /tmp/ report
```
For example, if the alert was raised on 2024-08-12, set <REPORT_PERIOD_TIMESTAMPS> to 2024-08-11,2024-08-13.

Logs from the nova-scheduler, nova-api, nova-conductor, placement-api pods for a 3-day period around the time of the alert:

ctl_nodes=$(kubectl get nodes -l openstack-control-plane=enabled -o name)

kubectl -n osh-system exec -it deployment/rockoon -- bash

# for each node in ctl_nodes execute:
osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \
   --host <CTL_HOSTNAME> \
   --component nova \
   --component placement \
   --collector elastic \
   --workspace /tmp/ report

Logs from the Kubernetes objects:

kubectl -n osh-system exec -it deployment/rockoon -- bash
osctl sos --collector k8s --workspace /tmp/ report

Nova service data from the API:

kubectl -n openstack exec -it deployment/keystone-client -- bash

openstack server migration list
openstack compute service list --long
openstack resource provider list
# Get the server event list for each orphaned consumer id
openstack server event list <SERVER_ID>

Note

SERVER_ID is the orphaned consumer ID from the nova-placement-audit logs.

Create a support case and attach the obtained information.

Remove orphaned allocations¶

kubectl -n openstack exec -it deployment/nova-api-osapi -- bash

Remove orphaned allocations:

To remove all found orphaned allocations:

nova-manage placement audit --verbose --delete

To remove orphaned allocations on a specific resource provider:

nova-manage placement audit --verbose --delete --resource_provider <RESOURCE_PROVIDER_UUID>

Verify that no orphaned allocations exist:
```
nova-manage placement audit --verbose
```