Troubleshoot orphaned resource allocations

Available since MOSK 24.3

Orphaned resource allocations are entries in the Placement database that track resource consumption, but the corresponding consumer (instance) no longer exists on the compute nodes. As a result, the Nova scheduler mistakenly believes that compute nodes have more resources allocated than they actually have.

For example, orphaned resource allocations may occur when an instance is evacuated from a hypervisor while the related nova-compute service is down.

This section provides instructions on how to resolve orphaned resource allocations in Nova if they are detected on compute nodes.

Detect orphaned allocations

Orphaned allocations are detected by the nova-placement-audit CronJob that runs every four hours.

The osdpl-exporter service processes the nova-placement-audit CronJob output and exports current number of orphaned allocations to StackLight as an osdpl_nova_audit_orphaned_allocations metric. If the value of this metric is greater than 0, StackLight raises a major alert NovaOrphanedAllocationsDetected.

Collect logging data from the cluster

  1. Obtain the mapping with IDs of resource providers and related orphaned consumers:

    kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \
       jq .orphaned_allocations.detected
    

    Example of a system response:

    {
       "12ed66d0-00d8-40e5-a28b-19cdecd2211d": [
          {
             "consumer": "1e616d60-bc5b-436d-8d71-503d15de5c55",
             "resources": {
             "DISK_GB": 5,
             "MEMORY_MB": 512,
             "VCPU": 1
             }
          }
       ]
    }
    
  2. Obtain the list of the nova-compute services that have issues with orphaned allocations:

    1. Obtain the UUIDs of the resource providers containing orphaned allocations:

      rp_uuids=$(kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \
         jq -c '.orphaned_allocations.detected|keys')
      
    2. Obtain the hostnames of the compute nodes that correspond to the resource providers obtained in the previous step:

      cmp_fqdns=$(kubectl -n openstack exec -t deployment/keystone-client -- \
         openstack resource provider list -f json | \
         jq --argjson rp_uuids "$rp_uuids" -c ' .[] | select( [.uuid] | inside($rp_uuids) ) | .name')
      cmp_hostnames=$(for n in $(echo ${cmp_fqdns} | tr -d \"); do echo ${n%%.*}; done)
      
    3. List the nova-compute services that contain orphaned allocations:

      kubectl -n openstack exec -t deployment/keystone-client -- \
         openstack compute service list --service nova-compute --long -f json | \
         jq --arg hosts "$cmp_hostnames" -r '.[] | select( .Host | inside($hosts) )'
      

      Example of a system response:

      [{
         "ID": "14a1685a-798e-40f1-b490-a09a5c8f6f66",
         "Binary": "nova-compute",
         "Host": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk",
         "Zone": "nova",
         "Status": "enabled",
         "State": "down",
         "Updated At": "2024-08-12T07:52:46.000000",
         "Disabled Reason": null,
         "Forced Down": true
      }]
      
  3. Analyze the list of the nova-compute services obtained during the previous step:

    • For the nova-compute services in the down state, most probably there were evacuations of instances from the correspoding nodes when the services were down. If this is the case, proceed directly to Remove orphaned allocations. Otherwise, proceed with collecting the logs.

      To verify if the evacuations were performed:

      openstack server migration list --type evacuation --host <CMP_HOSTNAME> -f json
      

      Example of a system response:

      [{
         "Id": 3,
         "UUID": "d7c29e99-2f69-4f85-80ed-72e1ef71c099",
         "Source Node": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao.cluster.local",
         "Dest Node": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk.cluster.local",
         "Source Compute": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao",
         "Dest Compute": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk",
         "Dest Host": "10.10.0.61",
         "Status": "completed",
         "Server UUID": "1e616d60-bc5b-436d-8d71-503d15de5c55",
         "Old Flavor": null,
         "New Flavor": null,
         "Type": "evacuation",
         "Created At": "2024-08-07T09:01:08.000000",
         "Updated At": "2024-08-07T16:11:54.000000"
      }]
      
    • For the nova-compute services in the UP state, proceed with collecting the logs.

  4. Collect the following logs from the environment:

    Caution

    The log data can be significant in size. Ensure that there is sufficient space available in the /tmp/ directory of the OpenStack controller pod. Create a separate report for each node.

    • Logs from compute nodes for a 3-day period around the time of the alert:

      • From the node with the orphaned allocation

      • From the node with the actual allocation (where the instance exists, if any)

      kubectl -n osh-system exec -it deployment/openstack-controller -- bash
      
      osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \
         --host <CMP_HOSTNAME> \
         --component nova \
         --collector elastic \
         --collector nova \
         --workspace /tmp/ report
      

      For example, if the alert was raised on 2024-08-12, set <REPORT_PERIOD_TIMESTAMPS> to 2024-08-11,2024-08-13.

    • Logs from the nova-scheduler, nova-api, nova-conductor, placement-api pods for a 3-day period around the time of the alert:

      ctl_nodes=$(kubectl get nodes -l openstack-control-plane=enabled -o name)
      
      kubectl -n osh-system exec -it deployment/openstack-controller -- bash
      
      # for each node in ctl_nodes execute:
      osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \
         --host <CTL_HOSTNAME> \
         --component nova \
         --component placement \
         --collector elastic \
         --workspace /tmp/ report
      
    • Logs from the Kubernetes objects:

      kubectl -n osh-system exec -it deployment/openstack-controller -- bash
      osctl sos --collector k8s --workspace /tmp/ report
      
    • Nova service data from the API:

      kubectl -n openstack exec -it deployment/keystone-client -- bash
      
      openstack server migration list
      openstack compute service list --long
      openstack resource provider list
      # Get the server event list for each orphaned consumer id
      openstack server event list <SERVER_ID>
      

      Note

      SERVER_ID is the orphaned consumer ID from the nova-placement-audit logs.

  5. Create a support case and attach the obtained information.

Remove orphaned allocations

  1. Log in to the nova-api-osapi pod:

    kubectl -n openstack exec -it deployment/nova-api-osapi -- bash
    
  2. Remove orphaned allocations:

    • To remove all found orphaned allocations:

      nova-manage placement audit --verbose --delete
      
    • To remove orphaned allocations on a specific resource provider:

      nova-manage placement audit --verbose --delete --resource_provider <RESOURCE_PROVIDER_UUID>
      
  3. Verify that no orphaned allocations exist:

    nova-manage placement audit --verbose