The tl;dr is to 1) avoid trying to disconnect volumes on the
destination if they were never connected in the first place and
2) avoid trying to disconnect volumes on the destination using block
device info for the source.
Details:
* Only remotely disconnect volumes on the destination if the failure
was not during pre_live_migration(). When pre_live_migration() fails,
its exception handling deletes the Cinder attachment that was created
before re-raising and returning from the RPC call. And the BDM
connection_info in the database is not guaranteed to reference the
destination because a failure could have happened after the Cinder
attachment was created but before the new connection_info was saved
back to the database. In this scenario, there is no way to reliably
disconnect volumes in the destination remotely from the source because
the destination connection_info needed to do it might not be
available.
* Due to the first point, this adds exception handling to disconnect
the volumes while still on the destination, while the destination
connection_info is still available instead of trying to do it
remotely from the source afterward.
* Do not pass Cinder volume block_device_info when calling
rollback_live_migration_on_destination() because volume BDM records
have already been rolled back to contain info for the source by
that point. Not passing volume block_device_info will prevent
driver.destroy() and subsequently driver.cleanup() from attempting to
disconnect volumes on the destination using connection_info for the
source.
Closes-Bug: #1899835
Change-Id: Ia62b99a16bfc802b8ba895c31780e9956aa74c2d
This adds mocking of ComputeManager._live_migration_cleanup_flags()
to simulate no shared storage. Otherwise the test detects shared
storage and skips a second call to _disconnect_volume() that occurs in
the bug scenario when storage is local.
Related-Bug: #1899835
Change-Id: I06b19044876aab9b4585384352f8dccc39984526
This makes invalidate_resource_provider() have a cacheonly flag that
only invalidates our cache, but does not remove the provider from the
tree for efficiency.
Related to blueprint one-time-use-devices
Change-Id: I04dd5e984c5671d866804c258422e4230fce37b7
This patch enhances the libvirt fixture to better align with the real
libvirt output when handling hostdevs.
It adds the alias tag, which libvirt provides to specify the hostdev
name, and the address tag, which indicates the address seen by
the guest.
These two fields will be used in a subsequent patch to improve the
comparison between source and destination XMLs during migration.
Example:
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x82' slot='0x00' function='0x1'/>
</source>
<alias name='hostdev0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05'
function='0x0'/>
</hostdev>
The target goal of these series of patch is to enable VFIO devices
migration with kernel variant drivers.
Implements: blueprint migrate-vfio-devices-using-kernel-variant-drivers
Change-Id: I3ee3923f990dd6522a11849551a9d49c9fad426c
This patch adds the necessary documentation identified in:
- pci-passthrough: Explaining live migration and known issues.
- virtual-gpu: Updating the caveats section to clarify what to do
when VF devices are available instead of `mdev`.
The target goal of these series of patch is to enable VFIO devices
migration with kernel variant drivers.
Implements: blueprint migrate-vfio-devices-using-kernel-variant-drivers
Change-Id: I41271a8af5687fb1d18f9d0852492756e096720d
Today, when a user does not request live-migratable devices, the
migration should fail.
However, this failure is hard to detect because the end result is a
NoValidHost error when Nova exhausts its reschedule attempts. As a
result, it is difficult to determine why scheduling failed.
This patch adds a warning to aid in debugging and identifying the
root cause more easily.
The target goal of these series of patch is to enable VFIO devices
migration with kernel variant drivers.
Implements: blueprint migrate-vfio-devices-using-kernel-variant-drivers
Change-Id: I64448f30e5d692396c129d9239679e74051cde7f
This patch updates an incorrect comment to reflect the correct
behavior. It also improves variable naming for tags that need to be
removed from the device specifications.
The target goal of these series of patch is to enable VFIO devices
migration with kernel variant drivers.
Implements: blueprint migrate-vfio-devices-using-kernel-variant-drivers
Change-Id: I0ae1da59014725aa0065a7f4cfa629367fa5eaeb
This patch removes the _test_pci() method, which is no longer necessary
since flavor-based requests can now be live migrated.
The related tests have also been removed.
This fixes a bug where a user requests a live migration with a
flavor-based request and NUMA constraints (e.g., CPU affinity). In
this case, the code encounters the _test_pci() method and fails because
the check was originally designed to enforce port-based requests only,
causing an unnecessary failure.
Notes: This issue was discovered through functional tests that involve
a mix of port-based and flavor-based requests. The failure in this
scenario highlighted the unnecessary constraint.
A functional test reproducing this issue in a mixed-mode scenario
(port request + flavor-based request) will be provided in a subsequent
FUP patch.
The _test_pci() check was redundant, as a similar verification
is already performed earlier in the migration process.
Closes-Bug: 2103636
Implements: blueprint migrate-vfio-devices-using-kernel-variant-drivers
Change-Id: Icbeaadd94658ed44917d724446d484f6497f29e5
This change adds a latch_error_on_raise decorator which
is applied to the init_applciation function in our
common wsgi_app module.
This decorator will catch all non retryable exceptions
and cause future invocations of the function to always
return that same exception forever.
a reset function is also added to the decorated function
which should be called in our bases test class to
prevent cross test interactons.
Closes-Bug: #2103811
Related-Bug: #1882094
Change-Id: I44b1f7e2acc36a5b557d6d8788f6099f52bbdfb8
The stats module uses the _find_pool() call to find a matching pool for
a new device or a device that is being deallocated. If no existing pool
matches with the dev then then a new pool is created for it. The
pool matching logic was faulty as it did not remove all the metadata
keys from the pool like rp_uuid. So if the dev did not have that key but
the pool did then the dev did not match.
On the other hand the PCI allocation logic (when PCI in Placement is
enabled) assumed that devices from a single rp_uuid are always in a
single pool. As this assumption was broken by the above bug the PCI
allocation blindly tried to allocate resources for an rp_uuid from each
matching pool causing overallocation.
The main fix in this patch is to ignore the metadata tags in
_find_pool(). But also two safety net are added to the allocation logic.
The logic now asserts that the assumption is correct and if not (i.e. it
found multiple pools with the same rp_uuid) then it bails out. It also
does not ever blindly allocate the same rp_uuid request from multiple
pools.
Closes-Bug: #2098496
Change-Id: I9678230397fa1a3c735ee01ed756d5af3b4e1191
Add file to the reno documentation build to show release notes for
stable/2025.1.
Use pbr instruction to increment the minor version number
automatically so that master versions are higher than the versions on
stable/2025.1.
Sem-Ver: feature
Change-Id: Iba42aa129140dc494d99dede17f5ea7b44062d62
When validation of the node fails, since switching to the SDK the
address of the ValidationResult object is displayed instead of the
actual message. This has been broken since patch
Ibb5b168ee0944463b996e96f033bd3dfb498e304.
Closes-Bug: 2100009
Change-Id: I8fbdaadd125ece6a3050b2fbb772a7bd5d7e5304
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
We agreed by I2dd906f34118da02783bb7755e0d6c2a2b88eb5d on the support
envelope.
Pre-RC1, we need to add a service version in the object.
Post-RC1, depending on whether it's SLURP or not SLURP, we need to bump
the minimum version or not.
This patch only focuses on pre-RC1 stage.
Given Flamingo will be skippable, we will need a post-RC1 patch for updating the min
that will bump to Epoxy.
HTH.
Change-Id: Id74ebfeaaac7bd116b11ff7bdd86674feb825f0f
In oslo.limit 2.6.0 service endpoint discovery was added, provided by
three new config options:
[oslo_limit]
endpoint_service_type = ...
endpoint_service_name = ...
endpoint_region_name = ...
We can use the same config options if they are present to lookup the
service ID and region ID we need when calling the
GET /registered_limits API as part of the resource limit enforcement
strategy. This way, the user will not have to configure endpoint_id.
This will look for [oslo.limit]endpoint_id first and if it is not set,
it will do the discovery.
Closes-Bug: #1931875
Change-Id: Ida14303115e00a1460e6bef4b6d25fc68f343a4e
I have chosen to do a bit of a cleanup of the lookup of
minimum compute manager versions, I didn't like how we looked up
the minimum version several times for a single parent call for
both create and resize.
Change-Id: Ifc52d73b1328d3785e72be2c5cf741962c2b95da
There's a TODO to prevent passing random query strings to the
'/os-console-auth-tokens' API that should be addressed while we are
updating the API. Do it now.
Change-Id: Ic19f75b1e26ae048df110f6cd9217b706bf3c0a4
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
These are *super* annoying (and useless to boot, since there is nothing
we can do about them in the near term). Shut them ⬇️⬇️⬇️ down ⬇️⬇️⬇️.
Change-Id: I469dafa243b95749b34503c1f3e905d9d8c780d4
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>