996c4ff9e8
During graceful shutdown, compute service keep a 2nd RPC
server active which can be used to finish the in-progress
operations. Like live migration, resize and cold migrations
also perform RPC call among source and destination compute.
For those operation also, we can use 2nd RPC server and make
sure they will be completed during graceful shutdown.
A quick overview of what all RPC methods are involved in the
resize/cold migration and what all will be using 2nd RPC server:
Resize/cold migration
- prep_resize: No, resize/migration is not started yet.
- resize_instance: Yes, here the resize/migration starts.
- finish_resize: Yes
- cross cell resize case:
- prep_snapshot_based_resize_at_dest: NO, this is initial check and
migration is not started
- prep_snapshot_based_resize_at_source: Yes, this start the migration
Confirm resize: NO
- confirm_resize: NO
- cross cell confirm resize case:
- confirm_snapshot_based_resize - NO
Revert resize:
- revert_resize - NO
- check_instance_shared_storage: YES. This is called from dest to source
so we need source to respond to it so that revert can continue.
- finish_revert_resize on source- YES, at this stage, revert resize is
in progress and abandoning it here can lead migration to unreocverable
state.
- cross cell revert case:
- revert_snapshot_based_resize_at_dest: NO
- finish_revert_snapshot_based_resize_at_source: YES
Partial implement blueprint nova-services-graceful-shutdown-part1
Change-Id: If08b698d012a75b587144501d829403ec616f685
Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
59 lines
2.7 KiB
YAML
59 lines
2.7 KiB
YAML
---
|
|
features:
|
|
- |
|
|
Nova services now support graceful shutdown on ``SIGTERM``. When a service
|
|
receives ``SIGTERM``, it will stop accepting new RPC requests and wait for
|
|
in-progress tasks to reach a safe termination point.
|
|
|
|
The compute service creates a second RPC server on an ``compute-alt`` topic
|
|
which remains active during graceful shutdown, allowing compute service to
|
|
finish the in-progress tasks.
|
|
|
|
Currently below operations are using second RPC server:
|
|
|
|
* Live migration
|
|
* Cold migration
|
|
* Resize
|
|
* Revert resize
|
|
* Server external Event
|
|
* Get Console output
|
|
|
|
Nova added two new configuration options which will control this behavior:
|
|
|
|
* ``[DEFAULT]/graceful_shutdown_timeout`` - The overall time the service
|
|
waits before forcefully exit. This is defaults to 180 seconds for each
|
|
Nova services.
|
|
* ``[DEFAULT]/manager_shutdown_timeout`` - The time the service manager
|
|
waits for in-progress tasks to complete during graceful shutdown. This
|
|
is defaults to 160 seconds for each service manager. This must be less
|
|
than ``graceful_shutdown_timeout``.
|
|
|
|
You can increase these timeouts based on the traffic and how long the
|
|
long-running (e.g. live migrations) tasks take in your deployment.
|
|
|
|
We plan to improve the graceful shutdown in future releases by task
|
|
tracking and transitioning resources to a recoverable state. Until then,
|
|
this feature is experimental.
|
|
upgrade:
|
|
- |
|
|
The default value of ``[DEFAULT]/graceful_shutdown_timeout`` has been
|
|
changed from 60 to 180 seconds for all Nova services. This means that
|
|
when a Nova service receives ``SIGTERM``, it will now wait up to 180
|
|
seconds for a graceful shutdown before being forcefully terminated.
|
|
Operators using external system (e.g. k8s, systemd) to manage the
|
|
Nova serviecs should ensure that their service stop timeouts are set
|
|
to at least ``graceful_shutdown_timeout`` to avoid forcefully killing
|
|
service before Nova finish its graceful shutdown. For example, the
|
|
systemd ``TimeoutStopSec`` should be set to at least 180 seconds (or
|
|
greater) for Nova services.
|
|
- |
|
|
A new configuration option ``[DEFAULT]/manager_shutdown_timeout`` has been
|
|
added with a default value of 160 seconds. This controls how long the
|
|
service manager waits for in-progress tasks to finish during graceful
|
|
shutdown. Operators may want to tune this value based on how long their
|
|
typical long-running operations (e.g. live migrations) take to complete.
|
|
- |
|
|
The compute service now creates a second RPC server on the ``compute-alt``
|
|
topic. This means each compute worker will create an additional RabbitMQ
|
|
queue.
|