Commit Graph

60281 Commits

Author SHA1 Message Date
Nobuhiro MIKI 2fd034ec48 libvirt: Add 'COMPUTE_ADDRESS_SPACE_*' traits support
Based on the Libvirt and QEMU versions, two traits,
COMPUTE_ADDRESS_SPACE_PASSTHROUGH and COMPUTE_ADDRESS_SPACE_EMULATED,
are controlled. Since the two are supported from the same version on
the Libvirt and QEMU, Nova handles them in the same way.

Blueprint: libvirt-maxphysaddr-support
Depends-On: https://review.opendev.org/c/openstack/os-traits/+/871226
Signed-off-by: Nobuhiro MIKI <nmiki@yahoo-corp.jp>
Change-Id: If6c7169b7b8f43ad15a8992831824fb546e85aab
2023-07-24 17:09:19 +09:00
Zuul 7e25b672ef Merge "Add a new policy for cold-migrate with host" 2023-07-21 16:52:51 +00:00
Zuul b6f4c57b43 Merge "Drop Fedora support" 2023-07-18 17:49:07 +00:00
Zuul 9f77ba3b63 Merge "Add config option to configure TB cache size" 2023-07-17 23:44:45 +00:00
Zuul a0559af692 Merge "db: Avoid relying on autobegin" 2023-07-17 14:37:39 +00:00
Zuul e02c5f0e7a Merge "Populate ComputeNode.service_id" 2023-07-14 22:41:39 +00:00
yatinkarel 3f7cc63d94 Add config option to configure TB cache size
Qemu>=5.0.0 bumped the default tb-cache size to 1GiB(from 32MiB)
and this made it difficult to run multiple guest VMs on systems
running with lower memory. With Libvirt>=8.0.0 it's possible to
configure lower tb-cache size.

Below config option is introduced to allow configure
TB cache size as per environment needs, this only
applies to 'virt_type=qemu':-

[libvirt]tb_cache_size

Also enable this flag in nova-next job.

[1] https://github.com/qemu/qemu/commit/600e17b26
[2] https://gitlab.com/libvirt/libvirt/-/commit/58bf03f85

Closes-Bug: #1949606
Implements: blueprint libvirt-tb-cache-size
Change-Id: I49d2276ff3d3cc5d560a1bd96f13408e798b256a
2023-07-13 19:35:52 +05:30
Amit Uniyal f7ce4df51c Refactor CinderFixture
Replaced stub with MockPatch.

Change-Id: Iaf4a9182b79ec4d1c2d3436b3dc9a6c760cd48f9
2023-07-12 14:16:03 +00:00
Sean Mooney 6f56c5c9fd enable validations in nova-lvm
As of I8ca059a4702471d4d30ea5a06079859eba3f5a81 validations
are now requried for test_rebuild_volume_backed_server.
Validations are also required for any volume attach/detach based test
in general due to know qemu issues.

This patch just turns them back on to unblock the gate.

Closes-Bug: #2025813
Change-Id: Ia198f712e2ad277743aed08e27e480208f463ac7
2023-07-04 15:49:11 +00:00
Zuul 4b454febf7 Merge "database: Archive parent and child rows "trees" one at a time" 2023-07-02 08:08:52 +00:00
Zuul d56d1a828d Merge "Verify a move operation for cross_az_attach=False" 2023-06-26 11:44:46 +00:00
Sylvain Bauza 2d320f9b00 Add a new policy for cold-migrate with host
We add a new specific policy when a host value is provided for cold-migrate,
but by default it will only be an admin-only rule in order to not change
the behaviour.

Change-Id: I128242d5f689fdd08d74b1dcba861177174753ff
Implements: blueprint cold-migrate-to-host-policy
2023-06-26 11:34:12 +02:00
jskunda 86c542c56a Drop Fedora support
We are about to drop Fedora support as the latest image in upstream
has been transitioned to EOL. Centos 9 Stream has evolved as
replacement platform for new features. Patch which removes fedora
jobs and nodeset from devstack:
https://review.opendev.org/c/openstack/devstack/+/885467

Change-Id: Ib7d3dd93602c94fd801f8fe5daa26353b04f589b
2023-06-21 23:58:44 +02:00
melanie witt 697fa3c000 database: Archive parent and child rows "trees" one at a time
Previously, we archived deleted rows in batches of max_rows parents +
their child rows in a single database transaction. Doing it that way
limited how high a value of max_rows could be specified by the caller
because of the size of the database transaction it could generate.

For example, in a large scale deployment with hundreds of thousands of
deleted rows and constant server creation and deletion activity, a
value of max_rows=1000 might exceed the database's configured maximum
packet size or timeout due to a database deadlock, forcing the operator
to use a much lower max_rows value like 100 or 50.

And when the operator has e.g. 500,000 deleted instances rows (and
millions of deleted rows total) they are trying to archive, being
forced to use a max_rows value several orders of magnitude lower than
the number of rows they need to archive was a poor user experience.

This changes the logic to archive one parent row and its foreign key
related child rows one at a time in a single database transaction
while limiting the total number of rows per table as soon as it reaches
>= max_rows. Doing this will allow operators to choose more predictable
values for max_rows and get more progress per invocation of
archive_deleted_rows.

Closes-Bug: #2024258

Change-Id: I2209bf1b3320901cf603ec39163cf923b25b0359
2023-06-20 20:04:46 +00:00
melanie witt f6620d48c8 testing: Fix and robustify archive_deleted_rows test
The regexes in test_archive_deleted_rows for multiple cells were
incorrect in that they were not isolating the search pattern and rather
could match with other rows in the result table as well, resulting in a
false positive.

This fixes the regexes and also adds one more server to the test
scenario in order to make sure archive_deleted_rows iterates at least
once to expose bugs that may be present in its internal iteration.

This patch is in preparation for a future patch that will change the
logic in archive_deleted_rows. Making this test more robust will more
thoroughly test for regression.

Change-Id: If39f6afb6359c67aa38cf315ec90ffa386d5c142
2023-06-16 06:13:49 +00:00
Zuul 308633f93a Merge "cpu: fix the privsep issue when offlining the cpu" 2023-06-07 21:37:30 +00:00
Zuul a4ca440ed8 Merge "tests: Add missing args to sqlalchemy.Table" 2023-06-07 16:54:24 +00:00
Zuul c7b77aa17f Merge "tests: Pass parameters to sqlalchemy.text() as bindparams" 2023-06-07 16:54:17 +00:00
Zuul 0e997a428c Merge "db: Remove unnecessary 'insert()' argument" 2023-06-07 16:54:08 +00:00
Zuul c472d829fa Merge "db: Don't rely on branched connections" 2023-06-07 16:54:01 +00:00
Zuul 1fe8c4becb Merge "Fix failed count for anti-affinity check" 2023-06-07 14:35:52 +00:00
Zuul fc8951efb9 Merge "Process unlimited exceptions raised by unplug_vifs" 2023-06-07 14:16:10 +00:00
Zuul 86b1f1505a Merge "Add debug logging when Instance raises OrphanedObjectError" 2023-06-06 20:07:47 +00:00
Sylvain Bauza 3fab43786b cpu: fix the privsep issue when offlining the cpu
In Icb913ed9be8d508de35e755a9c650ba25e45aca2 we forgot to add a privsep
decorator for the set_offline() method.

Change-Id: I769d35907ab9466fe65b942295fd7567a757087a
Closes-Bug: #2022955
2023-06-06 16:26:05 +02:00
Yusuke Okada 56d320a203 Fix failed count for anti-affinity check
The late anti-affinity check runs in the compute manager to avoid
parallel scheduling requests to invalidate the anti-affinity server
group policy. When the check fails the instance is re-scheduled.
However this failure counted as a real instance boot failure of the
compute host and can lead to de-prioritization of the compute host
in the scheduler via BuildFailureWeigher. As the late anti-affinity
check does not indicate any fault of the compute host itself it
should not be counted towards the build failure counter.
This patch adds new build results to handle this case.

Closes-Bug: #1996732
Change-Id: I2ba035c09ace20e9835d9d12a5c5bee17d616718
Signed-off-by: Yusuke Okada <okada.yusuke@fujitsu.com>
2023-06-06 10:15:16 +02:00
Dan Smith afad847e4d Populate ComputeNode.service_id
The ComputeNode object already has a service_id field that we stopped
using a while ago. This moves us back to the point where we set it when
creating new ComputeNode records, and also migrates existing records
when they are loaded.

The resource tracker is created before we may have created the
service record, but is updated afterwards in the pre_start_hook().
So this adds a way for us to pass the service_ref to the resource
tracker during that hook so that it is present before the first time
we update all of our ComputeNode records. It also makes sure to pass
the Service through from the actual Service manager instead of looking
it up again to make sure we maintain the tight relationship and avoid
any name-based ambiguity.

Related to blueprint compute-object-ids

Change-Id: I5e060d674b6145c9797c2251a2822106fc6d4a71
2023-05-31 07:06:34 -07:00
Zuul 971809b4d4 Merge "db: Remove the legacy 'migration_version' table" 2023-05-30 20:56:59 +00:00
melanie witt 6f79d6321e Enforce quota usage from placement when unshelving
When [quota]count_usage_from_placement = true or
[quota]driver = nova.quota.UnifiedLimitsDriver, cores and ram quota
usage are counted from placement. When an instance is SHELVED_OFFLOADED,
it will not have allocations in placement, so its cores and ram should
not count against quota during that time.

This means however that when an instance is unshelved, there is a
possibility of going over quota if the cores and ram it needs were
allocated by some other instance(s) while it was SHELVED_OFFLOADED.

This fixes a bug where quota was not being properly enforced during
unshelve of a SHELVED_OFFLOADED instance when quota usage is counted
from placement. Test coverage is also added for the "recheck" quota
cases.

Closes-Bug: #2003991

Change-Id: I4ab97626c10052c7af9934a80ff8db9ddab82738
2023-05-23 01:02:05 +00:00
melanie witt 427b2cb4d6 Reproducer for bug 2003991 unshelving offloaded instance
This adds test coverage for:

  * Shelve/unshelve offloaded with legacy quota usage
  * Shelve/unshelve offloaded with quota usage from placement
  * Shelve/unshelve offloaded with unified limits
  * Shelve/unshelve with legacy quota usage
  * Shelve/unshelve with quota usage from placement
  * Shelve/unshelve with unified limits

Related-Bug: #2003991

Change-Id: Icc9b6366aebba2f8468e2127da7b7e099098513a
2023-05-22 22:19:01 +00:00
Zuul 71b105a4cf Merge "Fixes a typo in availability-zone doc" 2023-05-22 19:49:34 +00:00
Zuul 2dde4538bc Merge "vmwareapi: Mark driver as experimental" 2023-05-18 15:08:04 +00:00
Amit Uniyal e2264d7657 Fixes a typo in availability-zone doc
Change-Id: Ic1bb8abaf2cbdac31a4503b12f38e5e2d5aadcfd
2023-05-18 06:25:18 +00:00
melanie witt e0fbb6fc06 Add debug logging when Instance raises OrphanedObjectError
This logging would be helpful in debugging issues when
OrphanedObjectError is raised by an instance. Currently, there is
not a way to identify which instance is attempting to lazy-load a
field while orphaned. Being able to locate the instance in the
database could also help with recovery/cleanup when a problematic
record is disrupting operation of a deployment.

Change-Id: I093de2839c1bb7c949a0812e07b63de4cc5ed167
2023-05-17 23:54:46 +00:00
Zuul 02a63e0f1b Merge "tests: Use GreenThreadPoolExecutor.shutdown(wait=True)" 2023-05-17 09:20:59 +00:00
melanie witt c095cfe04e tests: Use GreenThreadPoolExecutor.shutdown(wait=True)
We are still having some issues in the gate where greenlets from
previous tests continue to run while the next test starts, causing
false negative failures in unit or functional test jobs.

This adds a new fixture that will ensure
GreenThreadPoolExecutor.shutdown() is called with wait=True, to wait
for greenlets in the pool to finish running before moving on.

In local testing, doing this does not appear to adversely affect test
run times, which was my primary concern.

As a baseline, I ran a subset of functional tests in a loop
until failure without the patch and after 11 hours, I got a failure
reproducing the bug. With the patch, running the same subset of
functional tests in a loop has been running for 24 hours and has not
failed yet.

Based on this, I think it may be worth trying this out to see if it
will help stability of our unit and functional test jobs. And if it
ends up impacting test run times or causes other issues, we can
revert it.

Partial-Bug: #1946339

Change-Id: Ia916310522b007061660172fa4d63d0fde9a55ac
2023-05-17 00:57:37 +00:00
Zuul b3fdd7ccf0 Merge "doc: Update version info" 2023-05-11 23:34:27 +00:00
Elod Illes fe125da63b CI: fix backport validator for new branch naming
validate-backport job started to fail as only old stable branch naming
is accepted. This patch extends the script to allow numbers and dot as
well in the branch names (like stable/2023.1).

Change-Id: Icbdcd5d124717e195d55d9e42530611ed812fadd
2023-05-11 16:23:53 +02:00
Zuul e9a54ff350 Merge "Bump nova-ceph-multstore timeout" 2023-05-11 10:51:49 +00:00
Dan Smith 6ff3237149 Bump nova-ceph-multstore timeout
The recent change(s) to enable a lot more SSHABLE checks puts the
runtime of the ceph job really close to the 2h timeout even when
things are working. Sometimes it times out before it finishes even
though things are progressing. Bump the timeout to avoid that.

Also bump us to 8G swap to match what is set on the parent ceph job
when we upgraded to jammy. We could just unset this, but better to
pin it high in case that job (defined elsewhere) changes. Our job
is the largest ceph job, so it makes sense that it keeps its own
swap level high.

Change-Id: I6cefd87671614d87d92e4675fbc989fc9453c8b9
2023-05-10 17:54:38 -07:00
melanie witt 41c64b94b0 Enable use of service user token with admin context
When the [service_user] section is configured in nova.conf, nova will
have the ability to send a service user token alongside the user's
token. The service user token is sent when nova calls other services'
REST APIs to authenticate as a service, and service calls can sometimes
have elevated privileges.

Currently, nova does not however have the ability to send a service user
token with an admin context. This means that when nova makes REST API
calls to other services with an anonymous admin RequestContext (such as
in nova-manage or periodic tasks), it will not be authenticated as a
service.

This adds a keyword argument to service_auth.get_auth_plugin() to
enable callers to provide a user_auth object instead of attempting to
extract the user_auth from the RequestContext.

The cinder and neutron client modules are also adjusted to make use of
the new user_auth keyword argument so that nova calls made with
anonymous admin request contexts can authenticate as a service when
configured.

Related-Bug: #2004555

Change-Id: I14df2d55f4b2f0be58f1a6ad3f19e48f7a6bfcb4
2023-05-10 14:52:59 +00:00
melanie witt db455548a1 Use force=True for os-brick disconnect during delete
The 'force' parameter of os-brick's disconnect_volume() method allows
callers to ignore flushing errors and ensure that devices are being
removed from the host.

We should use force=True when we are going to delete an instance to
avoid leaving leftover devices connected to the compute host which
could then potentially be reused to map to volumes to an instance that
should not have access to those volumes.

We can use force=True even when disconnecting a volume that will not be
deleted on termination because os-brick will always attempt to flush
and disconnect gracefully before forcefully removing devices.

Closes-Bug: #2004555

Change-Id: I3629b84d3255a8fe9d8a7cea8c6131d7c40899e8
2023-05-10 07:09:05 -07:00
Zuul 105afb338b Merge "Revert "Debug Nova APIs call failures"" 2023-05-09 23:54:04 +00:00
Zuul 0c397d60e7 Merge "Handle zero pinned CPU in a cell with mixed policy" 2023-05-09 12:46:37 +00:00
Zuul 07b7db090d Merge "Reproduce asym NUMA mixed CPU policy bug" 2023-05-05 10:07:51 +00:00
Zuul 8e4a7290f8 Merge "Fix get_segments_id with subnets without segment_id" 2023-05-04 10:31:26 +00:00
Zuul cb2cdee4c9 Merge "Have host look for CPU controller of cgroupsv2 location." 2023-05-04 10:27:32 +00:00
Zuul deac3a2f8a Merge "Save cell socket correctly when updating host NUMA topology" 2023-05-04 10:27:24 +00:00
Zuul ad3b3681b6 Merge "add hypervisor version weigher" 2023-05-04 01:29:06 +00:00
Jorge San Emeterio 973ff4fc1a Have host look for CPU controller of cgroupsv2 location.
Make the host class look under '/sys/fs/cgroup/cgroup.controllers' for support of the cpu controller. The host will try searching through cgroupsv1 first, just like up until now, and in the case that fails, it will try cgroupsv2 then. The host will not support the feature if both checks fail.

This new check needs to be mocked by all tests that focus on this piece of code, as it touches a system file that requires privileges. For such thing, the CGroupsFixture is defined to easily add suck mocking to all test cases that require so.

I also removed old mocking at test_driver.py in favor of the fixture from above.

Partial-Bug: #2008102
Change-Id: I99b57c27c8a4425389bec2b7f05af660bab85610
2023-05-03 15:03:07 -07:00
Sylvain Bauza 6d7bd6a034 Fix get_segments_id with subnets without segment_id
Unfortunatly when we merged Ie166f3b51fddeaf916cda7c5ac34bbcdda0fd17a we
forgot that subnets can have no segment_id field.

Change-Id: Idb35b7e3c69fe8efe498abe4ebcc6cad8918c4ed
Closes-Bug: #2018375
2023-05-03 17:00:14 +02:00