Graceful Shutdown¶
Nova services have experimental graceful shutdown support on SIGTERM. When
a service worker implementing an RPC server receives SIGTERM, that worker
stops accepting new RPC requests and waits for in-progress tasks to reach a
safe termination point before exiting. This reduces the risk of leaving
instances or migrations of instances in an unwanted or unrecoverable state.
If deployment has the multiple worker for the nova-conductor and
nova-scheduler service, then new requests are handled by the other workers.
Important
The current implementation waits for the
manager_shutdown_timeout time for in-progress tasks
to complete. A future release will improve this by a proper task tracking
system. As a result operations can be interrupted ungracefully if they do
not complete within this timeout and can leave instances in a unwanted
state.
How graceful shutdown works for nova-compute service¶
When nova-compute receives SIGTERM, the following sequence occurs:
The primary RPC server (
computetopic) stops accepting new requests.The secondary RPC server (
compute-alttopic) still active and handles the RPC requests needed to finish in-progress tasks.The service manager waits up to
manager_shutdown_timeoutseconds for in-progress tasks to complete.The secondary RPC server (
compute-alttopic) is stopped.The service is stopped.
For nova-conductor and nova-scheduler, the sequence is the same
except there is only one RPC server and the further requests are handled
by their other workers.
The additional RabbitMQ queue for compute service¶
nova-compute service maintains two RPC servers:
Primary server (
computetopic): Handles all new incoming requests during normal operation. This server is stopped first when a shutdown begins.Secondary server (
compute-alttopic): Receives requests for long-running operations that to be continued and completed during shutdown
Because a second RPC server, each compute node will have an additional RabbitMQ
queue named compute-alt.<hostname>.
Operations handled during shutdown¶
The following operations use the secondary RPC server so that they will be allowed to complete during a graceful shutdown:
Live migration
Cold migration
Revert resize
Cross-cell resize
External instance events
Get console output
When the compute node’s RPC version is older than 6.5, Nova automatically falls back to sending all operations to the primary RPC server. The secondary RPC server is not used in this case.
Configuration¶
Two configuration options control graceful shutdown behaviour. Both are in the
[DEFAULT] section of nova.conf of respective service.
The overall time the service waits before forcefully exit. This is defaults to 180 seconds for each Nova services.
If the service is not exited by this time, the service is stopped
instantaneously. The operators using the external system (e.g. k8s, systemd) to
manage the Nova serviecs should ensure that their service stop timeouts are set
to at least graceful_shutdown_timeout to avoid forcefully killing service
before Nova finish its graceful shutdown.
This controls how long the service waits for in-progress tasks to finish during graceful shutdown.
This is defaults to 160 seconds for each service. This must be less than
graceful_shutdown_timeout
Setting this option to 0 disables the wait entirely: the manager does not
wait for in-progress tasks before proceeding with shutdown.
The operators may want to set the above config options value based on how long their typical long-running operations (e.g. live migrations) take to complete.
Upgrade considerations¶
The default value of
graceful_shutdown_timeouthas been raised from 60 seconds (theoslo.servicedefault) to 180 seconds for all Nova services. If your service manager previously relied on the 60-second default, update its stop timeout to at least 180 seconds before upgrading.A new option
manager_shutdown_timeouthas been added with a default of 160 seconds. No action is required unless you want to change the value.nova-computeservice creates an additional RabbitMQ queue (compute-alt.<hostname>) on startup. Ensure your message broker has capacity for the additional queues.During a rolling upgrade where some compute nodes are still running a version older than 6.5, Nova will fall back to routing all operations through the primary
computequeue. The graceful shutdown feature only works when all compute nodes have been upgraded.