Adding more tests for graceful shutdown:
- shutdown the destination compute and see how live and cold migration
progress
- start build instance and ocne comoute start building instance then
shutdown the comoute service and see if build instance finish or not.
- revert resize server
Partial implement blueprint nova-services-graceful-shutdown-part1
Change-Id: I57132fb7b7fa614dfc138508581ff5a67aaed906
Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
During graceful shutdown, compute service keep a 2nd RPC
server active which can be used to finish the in-progress
operations. Like live migration, resize and cold migrations
also perform RPC call among source and destination compute.
For those operation also, we can use 2nd RPC server and make
sure they will be completed during graceful shutdown.
A quick overview of what all RPC methods are involved in the
resize/cold migration and what all will be using 2nd RPC server:
Resize/cold migration
- prep_resize: No, resize/migration is not started yet.
- resize_instance: Yes, here the resize/migration starts.
- finish_resize: Yes
- cross cell resize case:
- prep_snapshot_based_resize_at_dest: NO, this is initial check and
migration is not started
- prep_snapshot_based_resize_at_source: Yes, this start the migration
Confirm resize: NO
- confirm_resize: NO
- cross cell confirm resize case:
- confirm_snapshot_based_resize - NO
Revert resize:
- revert_resize - NO
- check_instance_shared_storage: YES. This is called from dest to source
so we need source to respond to it so that revert can continue.
- finish_revert_resize on source- YES, at this stage, revert resize is
in progress and abandoning it here can lead migration to unreocverable
state.
- cross cell revert case:
- revert_snapshot_based_resize_at_dest: NO
- finish_revert_snapshot_based_resize_at_source: YES
Partial implement blueprint nova-services-graceful-shutdown-part1
Change-Id: If08b698d012a75b587144501d829403ec616f685
Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
For graceful shutdown of compute service, it will have two RPC servers.
One RPC server is used for the new requests which will be stopped during
graceful shutdown and 2nd RPC server (listen on 'compute-alt' topic)
will be used to complete the in-progress operations.
We select the operations (case by case) and their RPC method to use
the 2nd PRC server so that they will not be interupted on shutdown
initiative and graceful shutdown time will keep 2nd RPC server active
for graceful_shutdown_timeout. A new method 'prepare_for_alt_rpcserver'
is added which will fallback to first RPC server if it detect the old
compute.
As this is upgrade impact, it bumps the compute/service version, adds
releasenotes for the same.
The list of operations who should use the 2nd RPC server will grow
evanutally and this commit moves the below operations to use the 2nd
RPC server:
* Live migration
- Live migration: It use 2nd RPC servers and will try to complete
the operation during shutdown.
- live_migration_force_complete does not need to use 2nd RPC server.
It is direct RPC request from API to compute and if that is
rejected during shutdown, it is fine and can be initiated again
once compute is up.
- live_migration_abort does not need to use 2nd RPC server. Ditto,
it is direct RPC request from API to compute. It cancel the queue
live migration but if migration is already started, then driver
cancel the migration. If it is rejected during shutdown because of
RPC is stopped, it is fine and can be initiated again.
* server external event
* Get server console
As graceful shutdown cannot be tested in tempest, this adds a new job
to test it. Currently it test the live migration operation which can
be extended to other operations who will use 2nd RPC server.
Partial implement blueprint nova-services-graceful-shutdown-part1
Change-Id: I4de3afbcfaefbed909a29a831ac18060c4a73246
Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
On Debian 13 (Trixie), libvirt packaging is modularized and
the libvirt-daemon-lock package (providing virtlockd) is
optional. The evacuate hook previously assumed all libvirt
services were installed and failed when stopping/starting
missing units.
Extract a reusable manage_libvirt_service.yaml task file that
checks if a service exists via systemctl list-unit-files
before managing its units. This prevents failures when
optional libvirt packages are not installed and future-proofs
against further packaging changes.
Generated-By: claude-code
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change-Id: Ie84e2e8ab2d3065b1562ee5e256fa163541955f7
Signed-off-by: Sean Mooney <work@seanmooney.info>
Awhile back change I52046e6f7acdfb20eeba67dda59cbb5169e5d17e disabled
cinder in the nova-ovs-hybrid-plug job and added checks for cinder
before attempting to run evacuate BFV tests.
Resource setup for BFV was however not bypassed and the attempt to
setup a BFV server resource fails with:
keystoneauth1.exceptions.catalog.EndpointNotFound: publicURL endpoint
for volumev3 service not found
This adds a bypass to avoid attempting to create a BFV server when
cinder is not available.
Change-Id: I52c7e5ce268bb38cee16c18c5523fe0e224970aa
modifies nova-ovs-hybrid-plug job to disable cinder and swift to
ensure we test for this going forward.
Change-Id: I52046e6f7acdfb20eeba67dda59cbb5169e5d17e
Recently a change landed in devstack [1] to install packages into a
global venv by default and the "nova" command was not symlinked for
compat, so jobs using run-evacuate-hook are failing with:
nova: command not found
We had intended to switch away from using novaclient CLI commands in our
scripts anyway, so we can just use this opportunity to switch to OSC.
[1]: If9bc7ba45522189d03f19b86cb681bb150ee2f25
Change-Id: Ifd969b84a99a9c0460bceb1a28fcee6e51cbb4ae
There's no q-agt service in an OVN deployment.
Change-Id: Ia25c966c70542bcd02f5540b5b94896c17e49888
Signed-off-by: Lucas Alvares Gomes <lucasagomes@gmail.com>
libvirtd was being restarted on the controller during negative
evacuation tests that rely on the service being to cause an
evacuation failure.
This change adds various virt services to the list of services stopped
and now disabled on the host to ensure these don't cause systemd to
restart libvirtd:
* virtlogd.service
* virtlogd-admin.socket
* virtlogd.socket
* virtlockd.service
* virtlockd-admin.socket
* virtlockd.socket
Closes-Bug: #1903979
Change-Id: Ic83252bbda76c205bcbf0eef184ce0b201e224fc
The recent switch to Focal introduced a change in behaviour for the
libvirtd service that can now be restarted through new systemd socket
services associated with it once stopped. As we need it to remain
stopped during the initial negative evacuation tests on the controller
we now need to also stop these socket services and then later restart
them.
Change-Id: I2333872670e9e6c905efad7461af4d149f8216b6
This change reworks the evacuation parts of the original
nova-live-migration job into a zuulv3 native ansible role and initial
job covering local ephemeral and iSCSI/LVM volume attached instance
evacuation. Future jobs will cover ceph and other storage backends.
Change-Id: I380e9ca1e6a84da2b2ae577fb48781bf5c740e23
This changes the nova-multi-cell job to essentially
force cross-cell resize and cold migration. By "force"
I mean there is only one compute in each cell and
resize to the same host is disabled, so the scheduler
has no option but to move the server to the other cell.
This adds a new role to write the nova policy.yaml file
to enable cross-cell resize and a pre-run playbook so
that the policy file setup before tempest runs.
Part of blueprint cross-cell-resize
Change-Id: Ia4f3671c40e69674afc7a96b5d9b198dabaa4224
For the most part this should be a pretty straight-forward
port of the run.yaml. The most complicated thing is executing
the post_test_hook.sh script. For that, a new post-run playbook
and role are added.
The relative path to devstack scripts in post_test_hook.sh itself
had to drop the 'new' directory since we are no longer executing
the script through devstack-gate anymore the 'new' path does not
exist.
Change-Id: Ie3dc90862c895a8bd9bff4511a16254945f45478