Based on the Libvirt and QEMU versions, two traits,
COMPUTE_ADDRESS_SPACE_PASSTHROUGH and COMPUTE_ADDRESS_SPACE_EMULATED,
are controlled. Since the two are supported from the same version on
the Libvirt and QEMU, Nova handles them in the same way.
Blueprint: libvirt-maxphysaddr-support
Depends-On: https://review.opendev.org/c/openstack/os-traits/+/871226
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Change-Id: If6c7169b7b8f43ad15a8992831824fb546e85aab
Qemu>=5.0.0 bumped the default tb-cache size to 1GiB(from 32MiB)
and this made it difficult to run multiple guest VMs on systems
running with lower memory. With Libvirt>=8.0.0 it's possible to
configure lower tb-cache size.
Below config option is introduced to allow configure
TB cache size as per environment needs, this only
applies to 'virt_type=qemu':-
[libvirt]tb_cache_size
Also enable this flag in nova-next job.
[1] https://github.com/qemu/qemu/commit/600e17b26
[2] https://gitlab.com/libvirt/libvirt/-/commit/58bf03f85
Closes-Bug: #1949606
Implements: blueprint libvirt-tb-cache-size
Change-Id: I49d2276ff3d3cc5d560a1bd96f13408e798b256a
As of I8ca059a4702471d4d30ea5a06079859eba3f5a81 validations
are now requried for test_rebuild_volume_backed_server.
Validations are also required for any volume attach/detach based test
in general due to know qemu issues.
This patch just turns them back on to unblock the gate.
Closes-Bug: #2025813
Change-Id: Ia198f712e2ad277743aed08e27e480208f463ac7
We add a new specific policy when a host value is provided for cold-migrate,
but by default it will only be an admin-only rule in order to not change
the behaviour.
Change-Id: I128242d5f689fdd08d74b1dcba861177174753ff
Implements: blueprint cold-migrate-to-host-policy
We are about to drop Fedora support as the latest image in upstream
has been transitioned to EOL. Centos 9 Stream has evolved as
replacement platform for new features. Patch which removes fedora
jobs and nodeset from devstack:
https://review.opendev.org/c/openstack/devstack/+/885467
Change-Id: Ib7d3dd93602c94fd801f8fe5daa26353b04f589b
Previously, we archived deleted rows in batches of max_rows parents +
their child rows in a single database transaction. Doing it that way
limited how high a value of max_rows could be specified by the caller
because of the size of the database transaction it could generate.
For example, in a large scale deployment with hundreds of thousands of
deleted rows and constant server creation and deletion activity, a
value of max_rows=1000 might exceed the database's configured maximum
packet size or timeout due to a database deadlock, forcing the operator
to use a much lower max_rows value like 100 or 50.
And when the operator has e.g. 500,000 deleted instances rows (and
millions of deleted rows total) they are trying to archive, being
forced to use a max_rows value several orders of magnitude lower than
the number of rows they need to archive was a poor user experience.
This changes the logic to archive one parent row and its foreign key
related child rows one at a time in a single database transaction
while limiting the total number of rows per table as soon as it reaches
>= max_rows. Doing this will allow operators to choose more predictable
values for max_rows and get more progress per invocation of
archive_deleted_rows.
Closes-Bug: #2024258
Change-Id: I2209bf1b3320901cf603ec39163cf923b25b0359
The regexes in test_archive_deleted_rows for multiple cells were
incorrect in that they were not isolating the search pattern and rather
could match with other rows in the result table as well, resulting in a
false positive.
This fixes the regexes and also adds one more server to the test
scenario in order to make sure archive_deleted_rows iterates at least
once to expose bugs that may be present in its internal iteration.
This patch is in preparation for a future patch that will change the
logic in archive_deleted_rows. Making this test more robust will more
thoroughly test for regression.
Change-Id: If39f6afb6359c67aa38cf315ec90ffa386d5c142
In Icb913ed9be8d508de35e755a9c650ba25e45aca2 we forgot to add a privsep
decorator for the set_offline() method.
Change-Id: I769d35907ab9466fe65b942295fd7567a757087a
Closes-Bug: #2022955
The late anti-affinity check runs in the compute manager to avoid
parallel scheduling requests to invalidate the anti-affinity server
group policy. When the check fails the instance is re-scheduled.
However this failure counted as a real instance boot failure of the
compute host and can lead to de-prioritization of the compute host
in the scheduler via BuildFailureWeigher. As the late anti-affinity
check does not indicate any fault of the compute host itself it
should not be counted towards the build failure counter.
This patch adds new build results to handle this case.
Closes-Bug: #1996732
Change-Id: I2ba035c09ace20e9835d9d12a5c5bee17d616718
Signed-off-by: Yusuke Okada <okada.yusuke@fujitsu.com>
The ComputeNode object already has a service_id field that we stopped
using a while ago. This moves us back to the point where we set it when
creating new ComputeNode records, and also migrates existing records
when they are loaded.
The resource tracker is created before we may have created the
service record, but is updated afterwards in the pre_start_hook().
So this adds a way for us to pass the service_ref to the resource
tracker during that hook so that it is present before the first time
we update all of our ComputeNode records. It also makes sure to pass
the Service through from the actual Service manager instead of looking
it up again to make sure we maintain the tight relationship and avoid
any name-based ambiguity.
Related to blueprint compute-object-ids
Change-Id: I5e060d674b6145c9797c2251a2822106fc6d4a71
When [quota]count_usage_from_placement = true or
[quota]driver = nova.quota.UnifiedLimitsDriver, cores and ram quota
usage are counted from placement. When an instance is SHELVED_OFFLOADED,
it will not have allocations in placement, so its cores and ram should
not count against quota during that time.
This means however that when an instance is unshelved, there is a
possibility of going over quota if the cores and ram it needs were
allocated by some other instance(s) while it was SHELVED_OFFLOADED.
This fixes a bug where quota was not being properly enforced during
unshelve of a SHELVED_OFFLOADED instance when quota usage is counted
from placement. Test coverage is also added for the "recheck" quota
cases.
Closes-Bug: #2003991
Change-Id: I4ab97626c10052c7af9934a80ff8db9ddab82738
This adds test coverage for:
* Shelve/unshelve offloaded with legacy quota usage
* Shelve/unshelve offloaded with quota usage from placement
* Shelve/unshelve offloaded with unified limits
* Shelve/unshelve with legacy quota usage
* Shelve/unshelve with quota usage from placement
* Shelve/unshelve with unified limits
Related-Bug: #2003991
Change-Id: Icc9b6366aebba2f8468e2127da7b7e099098513a
This logging would be helpful in debugging issues when
OrphanedObjectError is raised by an instance. Currently, there is
not a way to identify which instance is attempting to lazy-load a
field while orphaned. Being able to locate the instance in the
database could also help with recovery/cleanup when a problematic
record is disrupting operation of a deployment.
Change-Id: I093de2839c1bb7c949a0812e07b63de4cc5ed167
We are still having some issues in the gate where greenlets from
previous tests continue to run while the next test starts, causing
false negative failures in unit or functional test jobs.
This adds a new fixture that will ensure
GreenThreadPoolExecutor.shutdown() is called with wait=True, to wait
for greenlets in the pool to finish running before moving on.
In local testing, doing this does not appear to adversely affect test
run times, which was my primary concern.
As a baseline, I ran a subset of functional tests in a loop
until failure without the patch and after 11 hours, I got a failure
reproducing the bug. With the patch, running the same subset of
functional tests in a loop has been running for 24 hours and has not
failed yet.
Based on this, I think it may be worth trying this out to see if it
will help stability of our unit and functional test jobs. And if it
ends up impacting test run times or causes other issues, we can
revert it.
Partial-Bug: #1946339
Change-Id: Ia916310522b007061660172fa4d63d0fde9a55ac
validate-backport job started to fail as only old stable branch naming
is accepted. This patch extends the script to allow numbers and dot as
well in the branch names (like stable/2023.1).
Change-Id: Icbdcd5d124717e195d55d9e42530611ed812fadd
The recent change(s) to enable a lot more SSHABLE checks puts the
runtime of the ceph job really close to the 2h timeout even when
things are working. Sometimes it times out before it finishes even
though things are progressing. Bump the timeout to avoid that.
Also bump us to 8G swap to match what is set on the parent ceph job
when we upgraded to jammy. We could just unset this, but better to
pin it high in case that job (defined elsewhere) changes. Our job
is the largest ceph job, so it makes sense that it keeps its own
swap level high.
Change-Id: I6cefd87671614d87d92e4675fbc989fc9453c8b9
When the [service_user] section is configured in nova.conf, nova will
have the ability to send a service user token alongside the user's
token. The service user token is sent when nova calls other services'
REST APIs to authenticate as a service, and service calls can sometimes
have elevated privileges.
Currently, nova does not however have the ability to send a service user
token with an admin context. This means that when nova makes REST API
calls to other services with an anonymous admin RequestContext (such as
in nova-manage or periodic tasks), it will not be authenticated as a
service.
This adds a keyword argument to service_auth.get_auth_plugin() to
enable callers to provide a user_auth object instead of attempting to
extract the user_auth from the RequestContext.
The cinder and neutron client modules are also adjusted to make use of
the new user_auth keyword argument so that nova calls made with
anonymous admin request contexts can authenticate as a service when
configured.
Related-Bug: #2004555
Change-Id: I14df2d55f4b2f0be58f1a6ad3f19e48f7a6bfcb4
The 'force' parameter of os-brick's disconnect_volume() method allows
callers to ignore flushing errors and ensure that devices are being
removed from the host.
We should use force=True when we are going to delete an instance to
avoid leaving leftover devices connected to the compute host which
could then potentially be reused to map to volumes to an instance that
should not have access to those volumes.
We can use force=True even when disconnecting a volume that will not be
deleted on termination because os-brick will always attempt to flush
and disconnect gracefully before forcefully removing devices.
Closes-Bug: #2004555
Change-Id: I3629b84d3255a8fe9d8a7cea8c6131d7c40899e8
Make the host class look under '/sys/fs/cgroup/cgroup.controllers' for support of the cpu controller. The host will try searching through cgroupsv1 first, just like up until now, and in the case that fails, it will try cgroupsv2 then. The host will not support the feature if both checks fail.
This new check needs to be mocked by all tests that focus on this piece of code, as it touches a system file that requires privileges. For such thing, the CGroupsFixture is defined to easily add suck mocking to all test cases that require so.
I also removed old mocking at test_driver.py in favor of the fixture from above.
Partial-Bug: #2008102
Change-Id: I99b57c27c8a4425389bec2b7f05af660bab85610
Unfortunatly when we merged Ie166f3b51fddeaf916cda7c5ac34bbcdda0fd17a we
forgot that subnets can have no segment_id field.
Change-Id: Idb35b7e3c69fe8efe498abe4ebcc6cad8918c4ed
Closes-Bug: #2018375