Live-Migration per instance timeout

https://blueprints.launchpad.net/nova/+spec/live-migration-per-instance-timeout

Add a new microversion to live-migrate API to abort or force complete any libvirt live-migration operation after a given timeout.

Problem description

Nova currently optimizes for limited guest downtime, over ensuring the live-migration operation always succeeds. This can make live-migration in Nova look much less “reliable” than live-migration offered in other cloud and server virt systems.

A key observation is that the trade off between guest liveness and how long you are willing to wait for a live-migration to complete is not the same for every instance, nor for each live-migration API call made on the same instance. If a failed live-migration means the guest now has to stay on the host you are in the process of patching and rebooting, the guest will have significantly more downtime than if you had a small increase in the downtime the VM would experience during live-migration.

With current live-migrate API and config options, operators do not have fine-grained control over per instance live-migrate operations. If they want to treat any particular instance live-migrate operation different then they have to change the related config value to better fit and restart compute services which makes live-migration experience very unpleasant.

Given the recent removal of the progress timeout, we have discussed with operators that they would like to customize the timeout per live-migration operation. Based on the VM involved and the cost of not moving the VM, they can make the call of how long to wait. In a similar way, they want to decide if they should abort after that timeout (avoiding the VM having any more downtime than libvirt.live_migration_downtime), or force the live-migration to move (allowing more downtime than libvirt.live_migration_downtime to ensure the VM moves).

If we give operators the ability to set a custom timeout per live-migration operation, this causes some conflict with some other configuration options. Nova tells libvirt only to allow a live-migration to complete if there will be no more than libvirt.live_migration_downtime milliseconds of downtime. To further reduce the impact of live-migration on the guest VM, Nova slowly ramps up the amount of allowed downtime up to that maximum value. Nova uses the config options libvirt.live_migration_downtime_steps and libvirt.live_migration_downtime_delay to decide how long to take before reaching libvirt.live_migration_downtime milliseconds of allowed VM downtime. Currently these configuration values must be carefully changed to match the value of libvirt.live_migration_completion_timeout, meaning not spend all the time ramping up and not allowing enough time for a VM to move before completion timeout expires. If we allow operators to specify their own timeout value per live-migration operation, we must find a way to reconcile this with logic that ramps up the amount of allowed downtime before the live-migration is allowed to complete.

Use Cases

  • Operators want to patch a host and want to move all the VM’s out of that host. In this case they want to force a VM to move when timeout is reached because they find the risk of possible needing to reboot the VM less acceptable than pausing the VM to make it move.

  • Operators want to move the busy VM out of a host to balance out their cluster. In this case they want flexibility to kick off live-migration operation with an option to cancel the operation when the timer expires.

Proposed change

Add a new microversion to Live-Migrate Server API to add support for following two optional parameters:

  • timeout_seconds - Optional parameter to specify time in seconds after which nova will take actions on the given live-migration operation. This will override the config option libvirt.live_migration_completion_timeout. Note, unlike the configuration this is an absolute timeout, not one scaled up to match the size of the VM.

  • on_timeout - This optional parameter can be set to force_complete or abort. This will override the config option: libvirt.live_migration_action_on_timeout, that defaults to abort.

To help upgrades, we return 400 for any requests containing either of the new timeout paramter and before all compute nodes have been upgraded to report at least the service version that matches when this feature was added.

To address issue with ramp up time, we propose to spend half of the specified completion timeout ramping up to maximum downtime as normal. After that, we jump up to libvirt.live_migration_downtime. This will ensure VM will spend half of the specified timeout with the best chance of letting live-migration complete without having to abort or force-complete.

Alternatives

Operators can call either the delete migration API to abort a running live-migration or call force-complete to trigger post-copy or pause the VM being live-migrated. However this is far from convenient, and can lead to races in timeouts happening just before calling force-complete.

There are many other ways we could modify the downtime ramp up logic. Given the discussions on re-working that logic we just do the minimum to ensure libvirt.live_migration_downtime is reached before we hit the timeout specified by the operator.

Data model impact

The Migration object takes two new params for live-migrate API:

  • timeout_seconds - integer attribute.

  • on_timeout - enum of ([“force_complete”, “abort”]).

REST API impact

  • URL: POST /v2.1/servers/{server_id}/action

    JSON request body:

    {
        "os-migrateLive": {
            "host": "target-host",
            "block_migration": "auto",
            "timeout_seconds": 60,
            "on_timeout": "force_complete"
        }
    }
    

A new microversion will be introduced to os-migrateLive API, which will take two additional and optional parameters timeout_seconds and on_timeout.

  • JSON schema for timeout_seconds:

    {
        "timeout_seconds": {
            "type": "integer",
            "minimum": 0
        }
    }
    
  • JSON schema for on_timeout:

    {
        "on_timeout": {
            "type": "string",
            "enum": [ "force_complete", "abort" ]
        }
    }
    

Security impact

None

Notifications impact

None

Other end user impact

Add support for API in python-novaclient.

Performance Impact

None

Other deployer impact

None

Developer impact

None

Implementation

Assignee(s)

Primary assignee:

Sarafraj Singh (raj_singh)

Other contributors:

OSIC

Work Items

  • Add logic in libvirt to make use of these new parameters.

  • Add API to expose per operation force-timeout and actions.

Dependencies

We first need the configuration added for the default timeout action: https://blueprints.launchpad.net/nova/+spec/live-migration-force-after-timeout

Testing

Need new tempest tests for the new API.

Look into busy workloads inside VMs to test the above API in the gate’s live-migration job.

Documentation Impact

Need to update api-ref with details of the new API.

Should also update the API concept guide to cover how best to use live-migration with all these new APIs we have added.

References

None

History

Revisions

Release Name

Description

Pike

Introduced