a877e0ed15
Partial implement blueprint nova-services-graceful-shutdown-part1 Change-Id: I18bdb4b9ca2663b5fa1f88b715d27411827b1c45 Signed-off-by: Ghanshyam Maan <gmaan.os14@gmail.com>
121 lines
4.9 KiB
ReStructuredText
121 lines
4.9 KiB
ReStructuredText
=================
|
|
Graceful Shutdown
|
|
=================
|
|
|
|
Nova services have experimental graceful shutdown support on ``SIGTERM``. When
|
|
a service worker implementing an RPC server receives ``SIGTERM``, that worker
|
|
stops accepting new RPC requests and waits for in-progress tasks to reach a
|
|
safe termination point before exiting. This reduces the risk of leaving
|
|
instances or migrations of instances in an unwanted or unrecoverable state.
|
|
If deployment has the multiple worker for the ``nova-conductor`` and
|
|
``nova-scheduler`` service, then new requests are handled by the other workers.
|
|
|
|
.. important::
|
|
|
|
The current implementation waits for the
|
|
:oslo.config:option:`manager_shutdown_timeout` time for in-progress tasks
|
|
to complete. A future release will improve this by a proper task tracking
|
|
system. As a result operations can be interrupted ungracefully if they do
|
|
not complete within this timeout and can leave instances in a unwanted
|
|
state.
|
|
|
|
How graceful shutdown works for nova-compute service
|
|
----------------------------------------------------
|
|
|
|
When ``nova-compute`` receives ``SIGTERM``, the following sequence occurs:
|
|
|
|
#. The primary RPC server (``compute`` topic) stops accepting new requests.
|
|
#. The secondary RPC server (``compute-alt`` topic) still active and handles
|
|
the RPC requests needed to finish in-progress tasks.
|
|
#. The service manager waits up to
|
|
:oslo.config:option:`manager_shutdown_timeout` seconds for in-progress
|
|
tasks to complete.
|
|
#. The secondary RPC server (``compute-alt`` topic) is stopped.
|
|
#. The service is stopped.
|
|
|
|
For ``nova-conductor`` and ``nova-scheduler``, the sequence is the same
|
|
except there is only one RPC server and the further requests are handled
|
|
by their other workers.
|
|
|
|
The additional RabbitMQ queue for compute service
|
|
-------------------------------------------------
|
|
|
|
``nova-compute`` service maintains two RPC servers:
|
|
|
|
* **Primary server** (``compute`` topic): Handles all new incoming requests
|
|
during normal operation. This server is stopped first when a shutdown begins.
|
|
* **Secondary server** (``compute-alt`` topic): Receives requests for
|
|
long-running operations that to be continued and completed during shutdown
|
|
|
|
Because a second RPC server, each compute node will have an additional RabbitMQ
|
|
queue named ``compute-alt.<hostname>``.
|
|
|
|
|
|
Operations handled during shutdown
|
|
----------------------------------
|
|
|
|
The following operations use the secondary RPC server so that they will be
|
|
allowed to complete during a graceful shutdown:
|
|
|
|
* Live migration
|
|
* Cold migration
|
|
* Revert resize
|
|
* Cross-cell resize
|
|
* External instance events
|
|
* Get console output
|
|
|
|
When the compute node's RPC version is older than 6.5, Nova automatically falls
|
|
back to sending all operations to the primary RPC server. The secondary RPC
|
|
server is not used in this case.
|
|
|
|
Configuration
|
|
-------------
|
|
|
|
Two configuration options control graceful shutdown behaviour. Both are in the
|
|
``[DEFAULT]`` section of ``nova.conf`` of respective service.
|
|
|
|
.. rubric:: :oslo.config:option:`graceful_shutdown_timeout`
|
|
|
|
The overall time the service waits before forcefully exit. This is defaults to
|
|
180 seconds for each Nova services.
|
|
|
|
If the service is not exited by this time, the service is stopped
|
|
instantaneously. The operators using the external system (e.g. k8s, systemd) to
|
|
manage the Nova serviecs should ensure that their service stop timeouts are set
|
|
to at least ``graceful_shutdown_timeout`` to avoid forcefully killing service
|
|
before Nova finish its graceful shutdown.
|
|
|
|
.. rubric:: :oslo.config:option:`manager_shutdown_timeout`
|
|
|
|
This controls how long the service waits for in-progress tasks to finish during
|
|
graceful shutdown.
|
|
|
|
This is defaults to 160 seconds for each service. This must be less than
|
|
``graceful_shutdown_timeout``
|
|
|
|
Setting this option to ``0`` disables the wait entirely: the manager does not
|
|
wait for in-progress tasks before proceeding with shutdown.
|
|
|
|
The operators may want to set the above config options value based on how long
|
|
their typical long-running operations (e.g. live migrations) take to complete.
|
|
|
|
Upgrade considerations
|
|
-----------------------
|
|
|
|
* The default value of ``graceful_shutdown_timeout`` has been raised from 60
|
|
seconds (the ``oslo.service`` default) to 180 seconds for all Nova services.
|
|
If your service manager previously relied on the 60-second default, update
|
|
its stop timeout to at least 180 seconds before upgrading.
|
|
|
|
* A new option ``manager_shutdown_timeout`` has been added with a default of
|
|
160 seconds. No action is required unless you want to change the value.
|
|
|
|
* ``nova-compute`` service creates an additional RabbitMQ queue
|
|
(``compute-alt.<hostname>``) on startup. Ensure your message broker has
|
|
capacity for the additional queues.
|
|
|
|
* During a rolling upgrade where some compute nodes are still running a version
|
|
older than 6.5, Nova will fall back to routing all operations through the
|
|
primary ``compute`` queue. The graceful shutdown feature only works when all
|
|
compute nodes have been upgraded.
|