Files
nova/releasenotes/notes/nova-services-graceful-shutdown-564a321e2769152d.yaml
T
Ghanshyam Maan 996c4ff9e8 Prepare resize/cold migration for graceful shutdown
During graceful shutdown, compute service keep a 2nd RPC
server active which can be used to finish the in-progress
operations. Like live migration, resize and cold migrations
also perform RPC call among source and destination compute.
For those operation also, we can use 2nd RPC server and make
sure they will be completed during graceful shutdown.

A quick overview of what all RPC methods are involved in the
resize/cold migration and what all will be using 2nd RPC server:

Resize/cold migration
- prep_resize: No, resize/migration is not started yet.
- resize_instance: Yes, here the resize/migration starts.
- finish_resize: Yes
- cross cell resize case:
  - prep_snapshot_based_resize_at_dest: NO, this is initial check and
    migration is not started
  - prep_snapshot_based_resize_at_source: Yes, this start the migration

Confirm resize: NO
- confirm_resize: NO
- cross cell confirm resize case:
  - confirm_snapshot_based_resize - NO

Revert resize:
- revert_resize - NO
- check_instance_shared_storage: YES. This is called from dest to source
  so we need source to respond to it so that revert can continue.
- finish_revert_resize on source- YES, at this stage, revert resize is
  in progress and abandoning it here can lead migration to unreocverable
  state.
- cross cell revert case:
  - revert_snapshot_based_resize_at_dest: NO
  - finish_revert_snapshot_based_resize_at_source: YES

Partial implement blueprint nova-services-graceful-shutdown-part1

Change-Id: If08b698d012a75b587144501d829403ec616f685
Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
2026-02-25 20:36:07 +00:00

59 lines
2.7 KiB
YAML

---
features:
- |
Nova services now support graceful shutdown on ``SIGTERM``. When a service
receives ``SIGTERM``, it will stop accepting new RPC requests and wait for
in-progress tasks to reach a safe termination point.
The compute service creates a second RPC server on an ``compute-alt`` topic
which remains active during graceful shutdown, allowing compute service to
finish the in-progress tasks.
Currently below operations are using second RPC server:
* Live migration
* Cold migration
* Resize
* Revert resize
* Server external Event
* Get Console output
Nova added two new configuration options which will control this behavior:
* ``[DEFAULT]/graceful_shutdown_timeout`` - The overall time the service
waits before forcefully exit. This is defaults to 180 seconds for each
Nova services.
* ``[DEFAULT]/manager_shutdown_timeout`` - The time the service manager
waits for in-progress tasks to complete during graceful shutdown. This
is defaults to 160 seconds for each service manager. This must be less
than ``graceful_shutdown_timeout``.
You can increase these timeouts based on the traffic and how long the
long-running (e.g. live migrations) tasks take in your deployment.
We plan to improve the graceful shutdown in future releases by task
tracking and transitioning resources to a recoverable state. Until then,
this feature is experimental.
upgrade:
- |
The default value of ``[DEFAULT]/graceful_shutdown_timeout`` has been
changed from 60 to 180 seconds for all Nova services. This means that
when a Nova service receives ``SIGTERM``, it will now wait up to 180
seconds for a graceful shutdown before being forcefully terminated.
Operators using external system (e.g. k8s, systemd) to manage the
Nova serviecs should ensure that their service stop timeouts are set
to at least ``graceful_shutdown_timeout`` to avoid forcefully killing
service before Nova finish its graceful shutdown. For example, the
systemd ``TimeoutStopSec`` should be set to at least 180 seconds (or
greater) for Nova services.
- |
A new configuration option ``[DEFAULT]/manager_shutdown_timeout`` has been
added with a default value of 160 seconds. This controls how long the
service manager waits for in-progress tasks to finish during graceful
shutdown. Operators may want to tune this value based on how long their
typical long-running operations (e.g. live migrations) take to complete.
- |
The compute service now creates a second RPC server on the ``compute-alt``
topic. This means each compute worker will create an additional RabbitMQ
queue.