Use 2nd RPC server in compute operations

For graceful shutdown of compute service, it will have two RPC servers.
One RPC server is used for the new requests which will be stopped during
graceful shutdown and 2nd RPC server (listen on 'compute-alt' topic)
will be used to complete the in-progress operations.

We select the operations (case by case) and their RPC method to use
the 2nd PRC server so that they will not be interupted on shutdown
initiative and graceful shutdown time will keep 2nd RPC server active
for graceful_shutdown_timeout. A new method 'prepare_for_alt_rpcserver'
is added which will fallback to first RPC server if it detect the old
compute.

As this is upgrade impact, it bumps the compute/service version, adds
releasenotes for the same.

The list of operations who should use the 2nd RPC server will grow
evanutally and this commit moves the below operations to use the 2nd
RPC server:

* Live migration

  - Live migration: It use 2nd RPC servers and will try to complete
    the operation during shutdown.
  - live_migration_force_complete does not need to use 2nd RPC server.
    It is direct RPC request from API to compute and if that is
    rejected during shutdown, it is fine and can be initiated again
    once compute is up.
  - live_migration_abort does not need to use 2nd RPC server. Ditto,
    it is direct RPC request from API to compute. It cancel the queue
    live migration but if migration is already started, then driver
    cancel the migration. If it is rejected during shutdown because of
    RPC is stopped, it is fine and can be initiated again.

* server external event
* Get server console

As graceful shutdown cannot be tested in tempest, this adds a new job
to test it. Currently it test the live migration operation which can
be extended to other operations who will use 2nd RPC server.

Partial implement blueprint nova-services-graceful-shutdown-part1

Change-Id: I4de3afbcfaefbed909a29a831ac18060c4a73246
Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
This commit is contained in:
Ghanshyam Maan
2026-02-19 02:48:45 +00:00
parent 4bce4480b9
commit d5ffb58a8d
14 changed files with 510 additions and 22 deletions
@@ -0,0 +1 @@
Run Nova graceful shutdown tests and verify the operations.
@@ -0,0 +1,47 @@
#!/bin/bash
source /opt/stack/devstack/openrc admin
set -x
set -e
confirm_resize() {
local server=$1
echo "Confirming resize on ${server}"
openstack server resize confirm "${server}"
count=0
while true; do
status=$(openstack server show "${server}" -f value -c status 2>/dev/null || echo "NOT_FOUND")
if [ "${status}" == "ACTIVE" ] || [ "${status}" == "ERROR" ]; then
break
fi
sleep 5
count=$((count+1))
if [ ${count} -eq 10 ]; then
echo "Timed out waiting for ${server} to be ACTIVE or Error after confirm resize"
break
fi
done
}
cleanup_server() {
local server=$1
status=$(openstack server show "${server}" -f value -c status 2>/dev/null || echo "NOT_FOUND")
if [ "${status}" == "VERIFY_RESIZE" ]; then
confirm_resize "${server}"
fi
status=$(openstack server show "${server}" -f value -c status 2>/dev/null || echo "NOT_FOUND")
if [ "${status}" == "ACTIVE" ] || [ "${status}" == "ERROR" ]; then
echo "Deleting ${server} (status: ${status})"
openstack server delete --wait "${server}"
else
echo "Skipping ${server} deletion (status: ${status})"
fi
}
for server in "$@"; do
cleanup_server "${server}"
done
@@ -0,0 +1,39 @@
#!/bin/bash
set -x
set -e
COMPUTE_HOST=$1
EXPECTED_STATE=${2:-active}
get_service_status() {
local host=$1
local status
status=$(ssh "${host}" systemctl is-active devstack@n-cpu || true)
echo "${status}"
}
wait_for_service_state() {
local host=$1
local expected=$2
local timeout=${3:-30}
local count=0
local status
status=$(get_service_status "${host}")
while [ "${status}" != "${expected}" ]; do
sleep 5
count=$((count+1))
if [ ${count} -eq ${timeout} ]; then
echo "Timed out waiting for compute service on ${host} to be ${expected} (current: ${status})"
exit 5
fi
status=$(get_service_status "${host}")
done
echo "Compute service on ${host} is ${expected}"
}
if [ "${EXPECTED_STATE}" == "active" ] && [ "$(get_service_status "${COMPUTE_HOST}")" != "active" ]; then
ssh "${COMPUTE_HOST}" sudo systemctl start devstack@n-cpu
fi
wait_for_service_state "${COMPUTE_HOST}" "${EXPECTED_STATE}"
@@ -0,0 +1,49 @@
#!/bin/bash
source /opt/stack/devstack/openrc admin
set -x
set -e
timeout=196
server_lm=$1
image_id=$(openstack image list -f value -c ID | awk 'NR==1{print $1}')
flavor_id=$(openstack flavor list -f value -c ID | awk 'NR==1{print $1}')
network_id=$(openstack network list --no-share -f value -c ID | awk 'NR==1{print $1}')
echo "Creating test server on subnode for graceful shutdown live migration test"
openstack --os-compute-api-version 2.74 server create --image ${image_id} --flavor ${flavor_id} \
--nic net-id=${network_id} --host ${SUBNODE_HOSTNAME} --wait ${server_lm}
echo "Starting live migration of ${server_lm} to ${CONTROLLER_HOSTNAME}"
openstack server migrate --live-migration \
--host ${CONTROLLER_HOSTNAME} ${server_lm}
# Wait for the migration to be in progress before returning so that the
# SIGTERM can be sent while the migrations are in progress.
count=0
while true; do
migration_status=$(openstack server migration list ${server_lm} \
-f value -c Status 2>/dev/null | head -1)
server_status=$(openstack server show ${server_lm} \
-f value -c status 2>/dev/null)
task_state=$(openstack server show ${server_lm} \
-f value -c OS-EXT-STS:task_state 2>/dev/null)
if [ "${migration_status}" == "preparing" ] || \
[ "${migration_status}" == "running" ] || \
[ "${task_state}" == "migrating" ]; then
echo "Live migration is in progress (status: ${migration_status}, task_state: ${task_state})"
break
elif [ "${migration_status}" == "completed" ] || \
{ [ "${server_status}" == "ACTIVE" ] && \
{ [ "${task_state}" == "None" ] || [ -z "${task_state}" ]; }; }; then
echo "Live migration has already completed"
exit 2
fi
count=$((count+1))
if [ ${count} -eq ${timeout} ]; then
echo "Timed out waiting for migrations to start"
exit 2
fi
done
@@ -0,0 +1,45 @@
#!/bin/bash
source /opt/stack/devstack/openrc admin
set -x
set -e
server=$1
# Wait for the server to finish live migration and become ACTIVE with
# no task_state, which indicates the migration has completed.
timeout=360
count=0
migration_start=$(date +%s)
while true; do
status=$(openstack server show ${server} -f value -c status)
task_state=$(openstack server show ${server} -f value -c OS-EXT-STS:task_state)
if [ "${status}" == "ACTIVE" ] && { [ "${task_state}" == "None" ] || [ -z "${task_state}" ]; }; then
migration_end=$(date +%s)
migration_duration=$((migration_end - migration_start))
echo "Migration is completed in ${migration_duration} seconds."
break
fi
if [ "${status}" == "ERROR" ]; then
echo "Server went to ERROR status during live migration"
exit 3
fi
sleep 5
count=$((count+1))
if [ ${count} -eq ${timeout} ]; then
echo "Timed out waiting for live migration to complete"
exit 5
fi
done
# Make sure the server moved to the controller.
host=$(openstack server show ${server} -f value -c OS-EXT-SRV-ATTR:host)
if [[ ${host} != ${CONTROLLER_HOSTNAME} ]]; then
echo "Unexpected host ${host} for server after live migration during graceful shutdown."
exit 4
fi
echo "Live migration during graceful shutdown completed successfully"
echo "Server ${server} is ACTIVE on ${host}"
@@ -0,0 +1,56 @@
- name: Graceful shutdown source compute live migration
block:
- name: Start live migrations of test servers
become: true
become_user: stack
script: "start_live_migration.sh server-lm1"
environment:
SUBNODE_HOSTNAME: "{{ hostvars['compute1']['ansible_hostname'] }}"
CONTROLLER_HOSTNAME: "{{ hostvars['controller']['ansible_hostname'] }}"
register: start_live_migrations_result
failed_when: start_live_migrations_result.rc not in [0, 2]
- name: Set fact if migrations completed or timed out before SIGTERM to source compute
set_fact:
live_migrations_completed_or_timeout: "{{ start_live_migrations_result.rc == 2 }}"
- name: Run graceful shutdown tests
when: not live_migrations_completed_or_timeout
block:
- name: Send SIGTERM to source compute to start the source compute graceful shutdown
delegate_to: compute1
become: true
shell: "kill -15 $(systemctl show devstack@n-cpu -p MainPID --value)"
- name: Verify live migration is completed during graceful shutdown
become: true
become_user: stack
script: "verify_live_migration.sh server-lm1"
environment:
CONTROLLER_HOSTNAME: "{{ hostvars['controller']['ansible_hostname'] }}"
# Sleep for 180 sec: default graceful_shutdown_timeout
- name: Sleep for 180 seconds to allow source compute graceful shutdown to complete
pause:
seconds: 180
- name: Verify compute service is stopped after graceful shutdown
become: true
become_user: stack
script: "start_and_verify_compute_service.sh {{ hostvars['compute1']['ansible_hostname'] }} inactive"
- name: Start and verify subnode compute service is running
become: true
become_user: stack
script: "start_and_verify_compute_service.sh {{ hostvars['compute1']['ansible_hostname'] }}"
- name: Cleanup test servers
become: true
become_user: stack
script: "cleanup_test_servers.sh server-lm1"
ignore_errors: true
- name: Fail if any test is skipped
fail:
msg: "One or more test is skipped due to operation is either completed or timed out before SIGTERM signal."
when: live_migrations_completed_or_timeout