Commit Graph

61661 Commits

Author SHA1 Message Date
Zuul 9efabaa993 Merge "Reproducer for bug 2114951" 2025-09-08 17:48:28 +00:00
Zuul 0dd7cb1fb0 Merge "libvirt: Disable VMCoreInfo device for SEV-encrypted instances" 2025-09-05 16:32:24 +00:00
Zuul 7d5521ac84 Merge "[pci]Keep used dev in Placement regardless of dev_spec" 2025-09-05 15:36:30 +00:00
Zuul ab744e5040 Merge "[PCI tracker]Remove non configured devs when freed" 2025-09-05 15:36:13 +00:00
Zuul cb6ed0e3c0 Merge "Reproduce bug/2115905" 2025-09-05 14:40:57 +00:00
Balazs Gibizer 4495f1f019 [pci]Keep used dev in Placement regardless of dev_spec
This changes the PCI Placement translator edge case handling logic to
resolve a bug preventing VM deletion.

If a device is allocated but removed from the dev_spec then we need to
keep the device in Placement otherwise the Placement update will be
rejected as we are trying to delete an RP that has allocations. This
prevent the deletion of a VM that is using this removed device.

The alternative would be to not allow the nova-compute service to
start if it detects this situation. However this situation can
happen in at least two very different cases:
1. The admin removed a dev_spec. In this case adding the dev_spec back,
   removing the VM, then removing the dev_spec is the right course of
   action and nova-compute failing to start would be OK to enforce this.

2. A device disappeared as the HW is died. In this case not allowing the
   nova-compute to start up would prevent the admin to migrate the
   other VMs away from the host before doing a HW replacement.

Note that this is fairly complex change due to the fact that based on
purely the PciDevice object we cannot differentiate between the two
cases:

1. A PciDevice object is being removed as the related device spec is
   removed from the configuration or the device is disappeared from
   the hypervisor.

2. A PciDevice object was held back for a while as the device spec is
   removed (or the device disappeared from the hypervisor) while the
   device was allocated to a VM. And now that VM is undergoing deletion.

In both case the PCI in Placement logic sees a PciDevice object in
dev.status.REMOVED and dev.instance_uuid = None. However the two cases
require different handling.

1. The related inventory can be removed from Placement

2. The related inventory cannot be removed from Placement as it is still
   being allocated to the VM that is undergoing deletion.

The second case is due to the sequence of events during a VM deletion
being:
* We destroy the VM on the hypervisor
* We update the PCI tracker to free the device. As the device was held back
  the tracker not just frees the device but removes it as well as it is
  not configured any more in the dev_spec so it should not go to
  AVAILABLE state.
* When the PCI tracker is updated it calls the PCI in Placement logic
  to update Placement inventories as well. At this point the VM deletion
  still in progress and the VM's allocation hasn't been deleted in
  Placement, so the Placement inventory cannot be removed as it is still
  allocated.
* After the resource tracker update is finished the compute manager
  deletes the VM's allocation in Placement.

So in this edge case we temporarily keep the Placement inventory and
only remove that in a subsequent periodic run where we are sure the
VM's allocation is gone. This means there is a time window when
the Placement inventory shows an extra resource even though that
resource has already been removed from the PCI tracker. During this
window the scheduler might select a host based on this ghost inventory
and the compute resource tracker will reject the boot request forcing
a normal re-schedule.

Closes-Bug: #2115905
Change-Id: Ie9d311ea9f59ff49593003e3773b690dd36fdeb2
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-09-04 10:05:20 +00:00
Balazs Gibizer f37cdf0c41 [PCI tracker]Remove non configured devs when freed
The PCI tracker handles the case when a device spec is removed from
the configuration while a device is still being allocated. It keeps the
device until the VM is deleted to avoid inconsistencies.

However the full removal of such a device needs not just the VM deletion,
but also a nova-compute restart. The device tracker just frees the
device during VM deletion but does not removed them until the next
nova-compute startup. This allows the device to be re-allocated by
another VM even though the device is not allowed by a device_spec.

This change adds yet another in memory dict to the pci tracker to track
these devices that are only kept until they are freed. Then during
free() this list is consulted and if the device is in the list then the
device is marked for removal as well.

This kills two birds with one stone:

* We prevent the re-allocation of the device as the state of the device
  will be set to REMOVED not AVAILABLE during VM deletion.

* As PCI in Placement relies on the state of the device to decide what
  to track in placement, this change makes sure that a device that
  needs to be removed, is now removed from placement too. Note that we have
  another bug that prevents this removal for now. But at least the
  reproducers of that bug now starts to behave the same regardless of
  how many device belongs to the same RP in placement.

Related-Bug: #2115905
Change-Id: I63c8fb2669a3c6b3adb77d210c0f9b39d3657c80
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-09-04 10:30:19 +02:00
Balazs Gibizer d86aa2d15a Reproduce bug/2115905
Both the PCI tracker and the PCI in Placement logic handles the case
when a device spec is removed from the configuration while a device
is still being allocated.

However there are edge cases in PCI in Placement that it not handled
well. Namely that if the VM with this allocation is deleted, then
depending on the amount of VFs the PF had originally, the logic might
try to delete the RP before the allocation is removed. That is
rejected by Placement. This prevent the deletion of such a VM and
therefore prevents one of the ways the original inconsistency can be

Note that with this patch we see two additional behaviors worth
mentioning:
* When the VM is successfully deleted (in a single VF or PF case) the
  PCI tracker still keeps the now free device in the DB and therefore PCI
  in Placement also keeps the RP. This keeps the non whitelisted device
  available for allocations until the next nova-compute restart.

* The PCI in Placement logic is different between the case where
  the last device is removed from an RP and the case where there
  are other devices on the RP, some that can be removed and some that
  cannot due to allocation.

Related-Bug: #2115905
Change-Id: Ib3febb77299da65ada24ed49849c04cbf3c41af1
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-09-04 10:19:31 +02:00
Zuul 9f156aa954 Merge "Fix 'nova-manage image_property set' command" 2025-09-03 17:29:24 +00:00
Zuul 74e4ff46db Merge "Do not yield in threading mode" 2025-09-03 16:59:21 +00:00
René Ribaud aa59133626 Reproducer for bug 2114951
Ambiguous regexp prevent using device_filename like 'mkwinimage-cdrom'.
The regexp matches a single character in the range between _ (index 94)
and r (index 114) (case sensitive)

Related-Bug: #2114951
Change-Id: I5c7ce18eb635a75d5aadc889e730ed77c9a10dc3
Signed-off-by: René Ribaud <rribaud@redhat.com>
2025-09-03 18:38:52 +02:00
Zuul fa31983299 Merge "[CI]Make nova-tox-py312-threading voting" 2025-09-03 10:40:39 +00:00
Zuul 6fa7f807ad Merge "Fix duplicate words" 2025-09-03 10:24:05 +00:00
Zuul a4df1dea8c Merge "Fix pci_tracker.save to delete all removed devs" 2025-09-02 20:20:45 +00:00
Zuul ba2d41e463 Merge "Add service version for Falmingo" 2025-09-02 20:20:10 +00:00
René Ribaud 60ba6afc49 Add service version for Falmingo
We agreed by I2dd906f34118da02783bb7755e0d6c2a2b88eb5d  on the support
envelope.
Pre-RC1, we need to add a service version in the object.
Post-RC1, depending on whether it's SLURP or not SLURP, we need to bump
the minimum version or not.

This patch only focuses on pre-RC1 stage.
Given Gazpacho won't be skippable, we won't need a post-RC1 patch for updating the min
that will continue to support Epoxy.

HTH.

Signed-off-by: René Ribaud <rribaud@redhat.com>
Change-Id: I5bf6ad1077fe62e6ff628d211b745857167280fb
2025-09-02 15:51:00 +02:00
René Ribaud 73724fef9a doc: mark the maximum microversion for 2025.2 Flamingo
Change-Id: I4158fc072ebeda7709bc08eb7d0b924cbc99ca5a
Signed-off-by: René Ribaud <rribaud@redhat.com>
2025-09-02 15:37:02 +02:00
Rajesh Tailor 68fbace8af Fix duplicate words
This change fixes duplicate consecutive words from docs
as well as code.

Signed-off-by: Rajesh Tailor <ratailor@redhat.com>
Change-Id: I236ff41fccf831023b6f85840097148a30e84743
2025-09-02 18:06:31 +05:30
Zuul 9c1d971f01 Merge "Reproduce that only half of the PCI devs are removed" 2025-09-02 11:08:42 +00:00
Rajesh Tailor 19f206f58c Fix 'nova-manage image_property set' command
As of now, if operator wants to set traits using 'nova-manage
image_property set' command, it fails with below error, because
in ImageMetaProps traits are not stored as individual fields, but
stored in 'traits_required' field which is of type list.

'Invalid image property name trait:CUSTOM_XYZ'

The setting of traits are handled by _set_attr_from_trait_names
method here [1].

This change handles the issue by continue the loop, if the
property startswith 'traits' string.

[1] https://opendev.org/openstack/nova/src/commit/725a307693806e6e32834198e23be75f771bebc1/nova/objects/image_meta.py#L708-L714

Closes-Bug: #2096341
Change-Id: Ifc20894801f723627726e3c9bed7076144542660
Signed-off-by: Rajesh Tailor <ratailor@redhat.com>
2025-09-02 12:22:55 +05:30
Zuul 539e971126 Merge "Follow-up of AMD SEV-ES support" 2025-09-01 11:59:27 +00:00
Zuul aed238c064 Merge "Drop CentOS 8 Stream" 2025-09-01 11:30:40 +00:00
Zuul e700b18f2b Merge "Replace remaining usage of Ubuntu Jammy" 2025-09-01 11:30:28 +00:00
Zuul 8ddf918a0b Merge "[test]RPC using threading or eventlet selectively" 2025-09-01 10:11:38 +00:00
Zuul 023c1eab47 Merge "Run unit test with threading mode" 2025-09-01 10:11:11 +00:00
Zuul 29eaf28acc Merge "Update min support for Flamingo" 2025-08-31 18:13:06 +00:00
Zuul 4301fc390e Merge "api: Fix validators for hw:cpu_max_* extra specs" 2025-08-31 18:12:45 +00:00
Takashi Kajinami 583d88308f Replace remaining usage of Ubuntu Jammy
Ubuntu Jammy is no longer supported since 2025.2 . Replace it by
Ubuntu Noble which is used in the other jobs.

Change-Id: I790fb06ede2c41cb80b3d2e8ff7faa7315c84016
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-31 16:36:44 +09:00
Zuul 7b8e054bd2 Merge "api: Correct expected errors" 2025-08-29 21:12:29 +00:00
Takashi Kajinami 79846eb0d0 libvirt: Disable VMCoreInfo device for SEV-encrypted instances
When VMCoreInfo device is enabled, the QEMU fw_cfg device in guest OS
requires DMA between host OS and guest OS through the device. However
DMA is prohibited when guest memory is encrypted using SEV, and
the attempt results in kernel crash.

Do not add VMCoreInfo when memory encryption is enabled.

Closes-Bug: #2117170
Change-Id: I05c7b1ae46ccd8d9aa42456b493ac6ee7ddd8bae
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-29 21:19:10 +09:00
Zuul 07ab08aa69 Merge "Allow to start unit test without eventlet" 2025-08-29 04:57:32 +00:00
Takashi Kajinami 87385d2411 Follow-up of AMD SEV-ES support
Address a few improvements we agreed to cover in follow-ups.

Also fix a few problems detected during the code update.
 - Fix SEV-ES rp not purged when SEV and SEV-ES are disabled at
   the same time. The previous logic requires 2 cycles which is
   not necessary.
 - Fix the lack of NOKS policy in SEV-ES.

Change-Id: I59866d39fcc6720e338c6736dffab4fd56b853da
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-29 13:54:19 +09:00
Zuul dcf90dbb25 Merge "Ask for pre-prod testing for native threading" 2025-08-29 04:35:24 +00:00
Zuul ce9dcea024 Merge "Purge nested SEV RPs when SEV is disabled" 2025-08-28 23:27:04 +00:00
Zuul c6aa3a9fa9 Merge "Add functional test scenario for mixed SEV RPs" 2025-08-28 23:25:14 +00:00
Zuul 32d76d08cb Merge "libvirt: Launch instances with SEV-ES memory encryption" 2025-08-28 23:24:30 +00:00
Zuul f4ca2e3ef9 Merge "Add hw_mem_encryption_model image property" 2025-08-28 21:03:27 +00:00
Zuul d5134798de Merge "Detect AMD SEV-ES support" 2025-08-28 20:36:36 +00:00
Zuul a5670dc442 Merge "Migrate MEM_ENCRYPTION_CONTEXT from root provider" 2025-08-28 20:36:20 +00:00
Takashi Kajinami a8386bdab3 Purge nested SEV RPs when SEV is disabled
We can determine exact names of these RPs using the compute node name,
independently from how nova is configured. So we can easily purge
these PRs.

Change-Id: I0a18e3a3750137061e04765f2feaf4889c6f5606
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-28 08:50:42 +09:00
Takashi Kajinami af287b71c4 Add functional test scenario for mixed SEV RPs
As a follow-up of change Iad51c32d0f64ef52513bd2f2b517c91f29c63787 ,
add a functional test scenario to ensure that new instances can be
created even when a cluster has both a compute node with old SEV RP and
the other with reshaped SEV RP, to simulate the real world upgrade
scenario in existing cluster with SEV feature enabled.

Change-Id: I2c576f8de05b69ab51743db53acf52bc2a35eb59
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-28 08:50:15 +09:00
Takashi Kajinami 4f5a3f3c00 libvirt: Launch instances with SEV-ES memory encryption
This is the last piece to allow users to request AMD SEV-ES for memory
encryption instead of AMD SEV. The CPU feature for memory encryption
can now be requested via the hw:mem_encryption_model flavor extra spec
or via the hw_mem_encryption_model image property.

Implements: blueprint amd-sev-es-libvirt-support
Change-Id: Ifc9b86ad7db887cc22b2cd252fe8adc81fdc29c6
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-28 08:47:49 +09:00
Takashi Kajinami dc6641baad Add hw_mem_encryption_model image property
This is prep work to support launching instances with AMD SEV-ES memory
encryption and adds the object field to select the CPU feature to
encrypt and protect memory data of instances.

Partially-Implements: blueprint amd-sev-es-libvirt-support
Change-Id: I71fde5438d4e22c9e2566f8a684c5a965a7f3dd3
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-28 08:47:49 +09:00
Takashi Kajinami 6c0a689d80 Detect AMD SEV-ES support
Detect AMD SEV-ES support by kernel/qemu/libvirt and generate a nested
RP for ASID slots for SEV-ES under the compute node RP.

Deprecate the [libvirt] num_memory_encryption_guests option because
the option is effective only for SEV, and now the maximum numbers for
SEV/SEV-ES guests can be detected by domain capabilities presented by
libvirt.

Note that creating an instance with memory encryption enabled now
requires AMD SEV trait, because these instances can't run with SEV-ES
slots, which are added by this change.

Partially-Implements: blueprint amd-sev-es-libvirt-support
Change-Id: I5968e75325b989225ed1fc6921257751ae227a0b
Signed-off-by: Takashi Kajinami <kajinamit@oss.nttdata.com>
2025-08-28 08:47:45 +09:00
Ghanshyam Maan f914cb185c Add service role in Nova policy
RBAC community wide goal phase-2[1] is to add 'service'
role for the service APIs policy rule. This commit
defaults the service APIs to 'service' role. This way
service APIs will be allowed for service user only.

Tempest tests also modified to simulate the service-to-service
communication. Tempest tests send the user with service
role to nova API.
- https://review.opendev.org/c/openstack/tempest/+/892639>

Partial implement blueprint policy-service-role-default

[1] https://governance.openstack.org/tc/goals/selected/consistent-and-secure-rbac.html#phase-2

Change-Id: I1565ea163fa2c8212f71c9ba375654d2aab28330
Signed-off-by: Ghanshyam Maan <gmaan@ghanshyammann.com>
2025-08-27 19:34:04 +00:00
Balazs Gibizer ea50365cce Do not yield in threading mode
If a service runs in threading mode nova.utils.cooperative_yield is noop
as yielding is only necessary for eventlet.

Change-Id: I72a52262f5c501f77d23ed56cbcd1a9c2be72fa7
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-08-27 19:03:34 +02:00
Balazs Gibizer 350cdd1b5e [CI]Make nova-tox-py312-threading voting
Change-Id: I6a220d03f7c879af0d714740102b2d84ce61ca69
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-08-27 19:03:34 +02:00
Balazs Gibizer 1318cd48a1 [test]RPC using threading or eventlet selectively
The nova test hardcoded to run the RPC servers in the test with eventlet
executor. We change that to be dynamic based on how the tests was
started it can use eventlet or threading.

This makes some of the so far hanging RPC dependent unit tests passing.

Change-Id: I5012122fe66d41459b68202e750391a1939d70d9
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-08-27 19:03:30 +02:00
Balazs Gibizer 83eed99a9f Run unit test with threading mode
The py312-threading tox target will run the currently working unit tests
with threading mode. We have an exclude list, those tests are
failing or hanging. Also the current test list might still have unstable
tests.

This also adds a non voting zuul job to run the new target.

Change-Id: Ibf41fede996fbf2ebaf6ae83df8cfde35acb2b7e
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-08-27 19:01:35 +02:00
Balazs Gibizer b278240370 Allow to start unit test without eventlet
The end goals is to be able to run at least some of the unit tests
without eventlet. But there are things preventing that for now.

We need to make sure that the oslo.sevice backed is not initialized to
eventlet by any early import code before our monkey_patch module can do
the selective backed selection based on the env variable.

The nova.tests.unit module had some import time code execution that is
forcing imports that initialize the oslo.service backend too early,
way before nova would do it in normal execution. We could remove
objects.register_all() from nova/tests/unit/__init__.py as it seems
tests are passing without it. Still that would not be enough so I
eventually decide to keep it.

The other issue is that the unit test discovery imports all modules
under nova.tests.unit and that eventually imports oslo.messaging and
that also forces oslo.service backend selection.

So we injected an early call to our smart monkey_patch module to preempt
that. This does not change the imported modules as monkey_patch module
imported anyhow via nova.test module. Just changed the order to allow
oslo.service backend selection explicitly.

After this patch the unit test can be run via

  OS_NOVA_DISABLE_EVENTLET_PATCHING=true tox -e py312

Most of the test will pass but there are a bunch of test timing out or
hanging.

Change-Id: I210cb6a30deaee779d55f88f0f57584c65b0dc05
Signed-off-by: Balazs Gibizer <gibi@redhat.com>
2025-08-27 18:54:26 +02:00