As stated in the forced-down API [1]:
> Setting a service forced down without completely fencing it will
> likely result in the corruption of VMs on that host.
Previously only the libvirtd service was stopped on the subnode prior to
calling this API, allowing n-cpu, q-agt and the underlying guest domains
to continue running on the host.
This change now ensures all devstack services are stopped on the subnode
and all active domains destroyed.
It is hoped that this will resolve bug #1813789 where evacuations have
timed out due to VIF plugging issues on the new destination host.
[1] https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down
Related-Bug: #1813789
Change-Id: I8af2ad741ca08c3d88efb9aa817c4d1470491a23
Reject live migration if there are virtual persistent memory resources.
Otherwise, if dest host has same vpmem backend file name as that used
by instance, live migration will succeed and these files will be used
but not tracked in Nova; if dest host has no those vpmems, it will
trigger an error.
Change-Id: I900f74d482fc87da5b1b5ec9db2ad5aefcfcfe7a
Closes-bug: #1863605
Implements: blueprint support-live-migration-with-virtual-persistent-memory
'sockets_per_cell', 'cores_per_socket' and 'threads_per_core'
are not the initializeing arguments of 'NUMATopology' object.
In test test_get_guest_config_numa_host_instance_topo_cpu_pinning
the code path does not care about the correctness of a host
numa toplogy due to being an instance with 'DEDICATED' cpu
policy, but We'd better refine it and create a correct numa
topology to avoid misunderstanding of code.
Change-Id: Ic1c9cf16c482939d0761d0cdab66c8eac07cad7b
This patch poisons the synchronized decorator in the unit test
to prevent adding you synchronized methods without the fair=True
flag.
Change-Id: I739025dacbcaa0f7adbe612c064f979bf6390880
Related-Bug: #1864122
The test_unshelve_offloaded_fails_due_to_neutron could fail due to race
condition. The test case only waits for the first instance.save() call
at [1] but the allocation delete happens after it. This causes that the
test case can still see the allocation of the offloaded server in
placement.
The fix makes sure that the test waits for the second instance.save() by
checking for the host of the instance.
[1] https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L5274-L5288
Related-Bug #1862633
Change-Id: Ic1c3d35749fbdc7f5b6f6ec1e16b8fcf37c10de8
Previously the ceph.sh script used during the nova-live-migration job
would only grep for a `compute` process when checking if the services
had been restarted. This check was bogus and would always return 0 as it
would always match itself. For example:
2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
aux | grep compute
2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute
Failures of this job were seen on the stable/pike branch where slower CI
nodes appeared to struggle to allow Libvirt to report to n-cpu in time
before Tempest was started. This in-turn caused instance build failures
and the overall failure of the job.
This change resolves this issue by switching to pgrep and ensuring
n-cpu services are reported as fully up after a cold restart before
starting the Tempest test run.
Closes-Bug: 1867380
Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
This is mostly code motion from the nova.virt.images module into privsep
to allow for both privileged and unprivileged calls to be made.
A privileged_qemu_img_info function is introduced allowing QEMU to
access devices requiring root privileges, such as host block devices.
Change-Id: I5ac03f923d9d181d22d44d8ec8fbc31eb0c3999e
While introducing PROJECT_READER_OR_SYSTEM_READER in policy
- https://review.opendev.org/#/c/706672
I added those rules in system_reader list in test_policy.
Let's maintain them in separate list for easy read. There
are going to be more policies in this list.
Partial implement blueprint policy-defaults-refresh
Change-Id: Ice3d79f0803efad60236e55b55bebd681056564c
The SDK API tests provide another set of verification of nova
behavior. In a perfect world, as we add new microversions to
nova, we'd add support for the microversion first to SDK along
with a test that should work if the MV isn't there, and then
adding the microversion the test should still work - which would
allow us to make sure SDK is always up to date with the very
latest and greatest nova api.
Change-Id: I2406bd6d9e69e33e57b715ff0812c5770b1b53d8
As of now, questions on the mailing list are going unanswered, and the
Nova team does not have a clear representative owner for the driver
to which bugs and other reports can be directed. There does not appear
to be a CI system running tests for the driver anymore, and the latest
indication from the community[1] points to it being potentially broken with
devstack.
This patch starts the deprecation timer for the driver and/or serves
as a flare to gauge interest (or lack thereof) in continuing to maintain
the driver.
1: http://lists.openstack.org/pipermail/openstack-discuss/2020-March/013066.html
Change-Id: Ie39e9605dc8cebff3795a29ea91dc08ee64a21eb
When the resource tracker has to lock a compute host for updates or
inspection, it uses a single semaphore. In most cases, this is fine, as
a compute process only is tracking one hypervisor. However, in Ironic, it's
possible for one compute process to track many hypervisors. In this
case, wait queues for instance claims can get "stuck" briefly behind
longer processing loops such as the update_resources periodic job. The
reason this is possible is because the oslo.lockutils synchronized
library does not use fair locks by default. When a lock is released, one
of the threads waiting for the lock is randomly allowed to take the lock
next. A fair lock ensures that the thread that next requested the lock
will be allowed to take it.
This should ensure that instance claim requests do not have a chance of
losing the lock contest, which should ensure that instance build
requests do not queue unnecessarily behind long-running tasks.
This includes bumping the oslo.concurrency dependency; fair locks were
added in 3.29.0 (I37577becff4978bf643c65fa9bc2d78d342ea35a).
Change-Id: Ia5e521e0f0c7a78b5ace5de9f343e84d872553f9
Related-Bug: #1864122
In some tests, we were doing an import with a full module path. This has
the side effect of importing every submodule on that path, which led to
some confusing side effects. Use 'import foo from bar' syntax instead
and clean up the damage.
Change-Id: I91a289630f31674dec1d785d67b5acda173b7d7e
Signed-off-by: Stephen Finucane <sfinucan@redhat.com>
Similar to change I8ae352ff3eeb760c97d1a6fa9d7a59e881d7aea1, if
we're processing a network-vif-deleted event while an instance
is being deleted, the asynchronous interface detach could fail
because the guest is gone from the hypervisor. The existing code
for handling this case was using a stale guest object so this
change tries to refresh the guest from the hypervisor and if the
guest is gone, the Host.get_guest() method should raise an
InstanceNotFound exception which we just trap, log and return.
Change-Id: Ic4c870cc5078d3f7ac6b2f96f8904c2a47de418e
Closes-Bug: #1797966
Recently we had a customer case where attempts to add new ironic nodes
to an existing undercloud resulted in half of the nodes failing to be
detected and added to nova. Ironic API returned all of the newly added
nodes when called by the driver, but half of the nodes were not
returned to the compute manager by the driver.
There was only one nova-compute service managing all of the ironic
nodes of the all-in-one typical undercloud deployment.
After days of investigation and examination of a database dump from the
customer, we noticed that at some point the customer had changed the
hostname of the machine from something containing uppercase letters to
the same name but all lowercase. The nova-compute service record had
the mixed case name and the CONF.host (socket.gethostname()) had the
lowercase name.
The hash ring logic adds all of the nova-compute service hostnames plus
CONF.host to hash ring, then the ironic driver reports only the nodes
it owns by retrieving a service hostname from the ring based on a hash
of each ironic node UUID.
Because of the machine hostname change, the hash ring contained, for
example: {'MachineHostName', 'machinehostname'} when it should have
contained only one hostname. And because the hash ring contained two
hostnames, the driver was able to retrieve only half of the nodes as
nodes that it owned. So half of the new nodes were excluded and not
added as new compute nodes.
This adds lowercasing of hosts that are added to the hash ring and
ignores case when comparing the CONF.host to the hash ring members
to avoid unnecessary pain and confusion for users that make hostname
changes that are otherwise functionally harmless.
This also adds logging of the set of hash ring members at level DEBUG
to help enable easier debugging of hash ring related situations.
Closes-Bug: #1866380
Change-Id: I617fd59de327de05a198f12b75a381f21945afb0