Use 2nd RPC server in compute operations

For graceful shutdown of compute service, it will have two RPC servers. One RPC server is used for the new requests which will be stopped during graceful shutdown and 2nd RPC server (listen on 'compute-alt' topic) will be used to complete the in-progress operations. We select the operations (case by case) and their RPC method to use the 2nd PRC server so that they will not be interupted on shutdown initiative and graceful shutdown time will keep 2nd RPC server active for graceful_shutdown_timeout. A new method 'prepare_for_alt_rpcserver' is added which will fallback to first RPC server if it detect the old compute. As this is upgrade impact, it bumps the compute/service version, adds releasenotes for the same. The list of operations who should use the 2nd RPC server will grow evanutally and this commit moves the below operations to use the 2nd RPC server: * Live migration - Live migration: It use 2nd RPC servers and will try to complete the operation during shutdown. - live_migration_force_complete does not need to use 2nd RPC server. It is direct RPC request from API to compute and if that is rejected during shutdown, it is fine and can be initiated again once compute is up. - live_migration_abort does not need to use 2nd RPC server. Ditto, it is direct RPC request from API to compute. It cancel the queue live migration but if migration is already started, then driver cancel the migration. If it is rejected during shutdown because of RPC is stopped, it is fine and can be initiated again. * server external event * Get server console As graceful shutdown cannot be tested in tempest, this adds a new job to test it. Currently it test the live migration operation which can be extended to other operations who will use 2nd RPC server. Partial implement blueprint nova-services-graceful-shutdown-part1 Change-Id: I4de3afbcfaefbed909a29a831ac18060c4a73246 Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
2026-02-19 02:48:45 +00:00
parent 4bce4480b9
commit d5ffb58a8d
14 changed files with 510 additions and 22 deletions
@@ -0,0 +1,55 @@
+---
+features:
+  - |
+    Nova services now support graceful shutdown on ``SIGTERM``. When a service
+    receives ``SIGTERM``, it will stop accepting new RPC requests and wait for
+    in-progress tasks to reach a safe termination point.
+
+    The compute service creates a second RPC server on an ``compute-alt`` topic
+    which remains active during graceful shutdown, allowing compute service to
+    finish the in-progress tasks.
+
+    Currently below operations are using second RPC server:
+
+    * Live migration
+    * Server external Event
+    * Get Console output
+
+    Nova added two new configuration options which will control this behavior:
+
+    * ``[DEFAULT]/graceful_shutdown_timeout`` - The overall time the service
+      waits before forcefully exit. This is defaults to 180 seconds for each
+      Nova services.
+    * ``[DEFAULT]/manager_shutdown_timeout`` - The time the service manager
+      waits for in-progress tasks to complete during graceful shutdown. This
+      is defaults to 160 seconds for each service manager. This must be less
+      than ``graceful_shutdown_timeout``.
+
+    You can increase these timeouts based on the traffic and how long the
+    long-running (e.g. live migrations) tasks take in your deployment.
+
+    We plan to improve the graceful shutdown in future releases by task
+    tracking and transitioning resources to a recoverable state. Until then,
+    this feature is experimental.
+upgrade:
+  - |
+    The default value of ``[DEFAULT]/graceful_shutdown_timeout`` has been
+    changed from 60 to 180 seconds for all Nova services. This means that
+    when a Nova service receives ``SIGTERM``, it will now wait up to 180
+    seconds for a graceful shutdown before being forcefully terminated.
+    Operators using external system (e.g. k8s, systemd) to manage the
+    Nova serviecs should ensure that their service stop timeouts are set
+    to at least ``graceful_shutdown_timeout`` to avoid forcefully killing
+    service before Nova finish its graceful shutdown. For example, the
+    systemd ``TimeoutStopSec`` should be set to at least 180 seconds (or
+    greater) for Nova services.
+  - |
+    A new configuration option ``[DEFAULT]/manager_shutdown_timeout`` has been
+    added with a default value of 160 seconds. This controls how long the
+    service manager waits for in-progress tasks to finish during graceful
+    shutdown. Operators may want to tune this value based on how long their
+    typical long-running operations (e.g. live migrations) take to complete.
+  - |
+    The compute service now creates a second RPC server on the ``compute-alt``
+    topic. This means each compute worker will create an additional RabbitMQ
+    queue.