NUMA-aware live migration¶
https://blueprints.launchpad.net/nova/+spec/numa-aware-live-migration
When an instance with NUMA characteristics is live-migrated, those characteristics are not recalculated on the destination compute host. In the CPU pinning case, using the source host’s pin mappings on the destination can lead to multiple instances being pinned to the same pCPUs. In the case of hugepage-backed instances, which are NUMA-localized, an instance needs to have its NUMA mapping recalculated on the destination compute host during a live migration.
Problem description¶
Note
In the following paragraphs the term NUMA is incorrectly used to
signify any guest characteristic that is expressed in the
InstanceNUMATopology
object, for example CPU pinning and hugepages. CPU
pinning can be achieved without a guest NUMA topology, but the two concepts
are unfortunately tightly coupled in Nova and instance pinning is not
possible without an instance NUMA topology. For this reason, NUMA is used
as a catchall term.
Note
This spec concentrates on the libvirt driver. Any higher level code (compute manager, conductor) will be as driver agnostic as possible.
The problem can best be described with three examples.
The first example is live migration with CPU pinning. An instance with a
hw:cpu_policy=dedicated
extra spec
and pinned CPUs is live-migrated. Its pin mappings are naively copied over to
the destination host. This creates two problems. First, its pinned pCPUs
aren’t properly claimed on the destination. This means that, should a second
instance with pinned CPUs land on the destination, both instances’ vCPUs could
be pinned to the same pCPUs. Second, any existing pin mappings on the
destination are ignored. If another instance already exists on the destination,
both instances’s vCPUs could be pinned to the same pCPUs. In both cases, the
dedicated
CPU policy is violated, potentially leading to unpredictable
performance degradation.
The second example is instances with hugepages. There are two hosts, each with two NUMA nodes and 8 1GB hugepages per node. Two identical instances are booted on the two hosts. Their virtual NUMA topology is one virtual NUMA node and 8 1GB memory pages. They land on their respective host’s NUMA node 0, consuming all 8 of its pages. One instance is live-migrated to the other host. The libvirt driver enforces strict NUMA affinity and does not regenerate the instance XML. Both instances end up on the hosts NUMA node 0, and the live-migrated instance fails to run.
The third example is an instance with a virtual NUMA topology (but without hugepages). If an instance affined to its host’s NUMA node 2 is live migrated to a host with only two NUMA nodes, and thus without a NUMA node 2, it will fail to run.
The first two of these examples are known bugs [1] [2].
Use Cases¶
As a cloud administrator, I want to live migrate instances with CPU pinning without the pin mappings overlapping on the destination compute host.
As a cloud administrator, I want live migration of hugepage-backed instances to work and for the instances to successfully run on the destination compute host.
As a cloud administrator, I want live migration of instances with an explicit NUMA topology to work and for the instances to successfully run on the destination compute host.
Proposed change¶
There are five aspects to supporting NUMA live migration. First, the instance’s
NUMA characteristics need to be recalculated to fit on the new host. Second,
the resources that the instance will consume on the new host need to be
claimed. Third, information about the instance’s new NUMA characteristics needs
to be generated on the destination (an InstanceNUMATopolgy
object is not
enough, more on that later). Fourth, this information needs to be sent from
the destination to the source, in order for the source to generate the correct
XML for the instance to be able to run on the destination. Finally, the
instance’s resource claims need to “converge” to reflect the success or failure
of the live migration. If the live migration succeeded, the usage on the source
needs to be released. If it failed, the claim on the destination needs to be
rolled back.
Resource claims¶
Let’s address the resource claims aspect first. An effort has begun to support NUMA resource providers in placement [3] and to standardize CPU resource tracking [4]. However, placement can only track inventories and allocations of quantities of resources. It does not track which specific resources are used. Specificity is needed for NUMA live migration. Consider an instance that uses 4 dedicated CPUs in a future where the standard CPU resource tracking spec [4] has been implemented. During live migration, the scheduler claims those 4 CPUs in placement on the destination. However, we need to prevent other instances from using those specific CPUs. Therefore, in addition to claiming quantities of CPUs in placement, we need to claim specific CPUs on the compute host. The compute resource tracker already exists for exactly this purpose, and it will continue to be used to claim specific resources on the destination, even in a NUMA-enabled placement future.
There is a time window between the scheduler picking a destination for the live
migration and the actual live migration RPC conversation between the two
compute hosts. Another instance could land on the destination during that time
window, using up NUMA resources that the scheduler thought were free. This race
leads to the resource claim failing on the destination. This spec proposes to
handle this claim failure using the existing MigrationPreCheckError
exception mechanism, causing the scheduler to pick a new host.
Fitting to the new host¶
An advantage of using the resource tracker is that it forces us to use a
MoveClaim
, thus giving us the instance new NUMA topology for free
(Claim._test_numa_topology
in nova/compute/claims.py
).
Generating the new NUMA information on the destination¶
However, having the new instance NUMA topology in the claim isn’t enough for
the source to generate the new XML. The simplest way to generate the new XML
fom the new instance NUMA topology would be to call the libvirt driver’s
_get_guest_numa_config
method (which handily accepts an
instance_numa_topology
as an argument). However, this needs to be done on
the destination, as it depends on the host NUMA topology.
_get_guest_numa_config
returns a tuple of LibvirtConfigObject
. The
information contained therein needs to somehow be sent to the source over the
wire.
The naive way would be to send the objects directly, or perhaps to call
to_xml
and send the resulting XML blob of text. This would be unversioned,
and there would be no schema. This could cause problems in the case of, for
example, a newer libvirt driver, which has dropped support for a particular
element or attribute, talking to an older libvirt driver, which still supports
it.
Because of this, and sticking to the existing OpenStack best practice of sending oslo versionedobjects over the wire, this spec proposes encode the necessary NUMA-related information as Nova versioned objects. These new objects should be as virt driver independent as reasonnably possible, but as the use case is still libvirt talking to libvirt, abstraction for the sake of abstraction is not appropriate either.
Sending the new NUMA Nova objects¶
Once the superconductor has chosen and/or validated the destination host, the relevant parts of the current live migration flow can be summarized by the following oversimplified pseudo sequence diagram.:
+-----------+ +---------+ +-------------+ +---------+
| Conductor | | Source | | Destination | | Driver |
+-----------+ +---------+ +-------------+ +---------+
| | | |
| check_can_live_migrate_destination() | | |
|-------------------------------------------------------------------------->| |
| | | |
| | check_can_live_migrate_source() | |
| |<-----------------------------------| |
| | | |
| | migrate_data | |
| |----------------------------------->| |
| | | |
| | migrate_data | |
|<--------------------------------------------------------------------------| |
| | | |
| live_migration(migrate_data) | | |
|------------------------------------->| | |
| | | |
| | pre_live_migration(migrate_data) | |
| |----------------------------------->| |
| | | |
| | migrate_data | |
| |<-----------------------------------| |
| | | |
| | live_migration(migrate_data) | |
| |------------------------------------------------->|
| | | |
In the proposed new flow, the destination compute manager asks the libvirt
driver to calculate the new LibvirtGuestConfig
objects using the new
instance NUMA topology obtained from the move claim. The compute manager
converts those LibvirtGuestConfig
objecs to the new NUMA Nova objects, and
adds them as fields to the LibvirtLiveMigrateData
migrate_data
object.
The latter eventually reaches the source libvirt driver, which uses it to
generate the new XML. The proposed flow is summarised in the following
diagram.:
+-----------+ +---------+ +-------------+ +---------+
| Conductor | | Source | | Destination | | Driver |
+-----------+ +---------+ +-------------+ +---------+
| | | |
| check_can_live_migrate_destination() | | |
|------------------------------------------------------------------------------------------->| |
| | | |
| | check_can_live_migrate_source() | |
| |<----------------------------------| |
| | | |
| | migrate_data | |
| |---------------------------------->| |
| | | +-----------------------------------+ |
| | |-| Obtain new_instance_numa_topology | |
| | | | from claim | |
| | | +-----------------------------------+ |
| | | |
| | | _get_guest_numa_config(new_instance_numa_topology) |
| | | ---------------------------------------------------->|
| | | |
| | | LibvirtConfigGuest objects |
| | |<-----------------------------------------------------|
| | | |
| | | +----------------------------------+ |
| | |-| Build new NUMA Nova objects from | |
| | | | LibvirtConfigGuest objects | |
| | | | and add to migrate_data | |
| | | +----------------------------------+ |
| | | |
| migrate_data + new NUMA Nova objects | |
|<-------------------------------------------------------------------------------------------| |
| | | |
| live_migration(migrate_data + new NUMA Nova objects) | | |
|------------------------------------------------------->| | |
| | | |
| | pre_live_migration() | |
| |---------------------------------->| |
| |<----------------------------------| |
| | | |
| | live_migration(migrate_data + new NUMA Nova objects) |
| |----------------------------------------------------------------------------------------->|
| | | |
| | | +-----------------------------------+ |
| | | | generate NUMA XML for destination |-|
| | | +-----------------------------------+ |
| | | |
Claim convergence¶
The claim object is a context manager, so it can in theory clean itself up if
any code within its context raises an unhandled exception. However, live
migration involves RPC casts between the compute hosts, making it impractical
to use the claim as a context manager. For that reason, if the live migration
fails, drop_move_claim
needs to be called manually during the rollback to
drop the claim from the destination. Whether to do this on the source in
rollback_live_migration
or in rollback_live_migration_at_destination
is
left as an implementation detail.
Similarly, if the live migration succeeds, drop_move_claim
needs to be
called to drop the claim from the source, similar to how _confirm_resize
does it in the compute manager. Whether to do this in post_live_migration
on the source or in post_live_migration_at_destination
is left as an
implementation detail.
Alternatives¶
Using move claims and the new instance NUMA topology calculated within essentially dictates the rest of the implementation.
When the superconductor calls the scheduler’s select_destination
method,
that call eventually ends up calling numa_fit_instance_to_host
(select_destinations
-> _schedule
-> _consume_selected_host
->
consume_from_request
-> _locked_consume_from_request
->
numa_fit_instance_to_host
). It would be conceivable to reuse that result.
However, the claim would still calculate its own new instance NUMA topology.
Data model impact¶
New version objects are created to transmit cell, CPU, emulator thread, and
hugepage nodeset mappings from the destination to the source. These objects are
added to LibvirtLiveMigrateData
.
REST API impact¶
None.
Security impact¶
None.
Notifications impact¶
None.
Other end user impact¶
None.
Performance Impact¶
None.
Other deployer impact¶
None.
Developer impact¶
None.
Upgrade impact¶
In the case of a mixed N/N+1 cloud, the possibilities for the exchange of information between the destination and the source are summarized in the following table. In it, no indicates that the new code is not present, old path indicates that the new code is present but choses to execute the old code for backwards compatibility, and yes indicates that the new functionality is used.
Old dest |
New dest |
|||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Old source |
|
|
||||||||||||||||||||
New source |
|
|
Implementation¶
Assignee(s)¶
- Primary assignee:
notartom
Work Items¶
Fail live migration of instances with NUMA topology [5] until this spec is fully implemented.
Add NUMA Nova objects
Add claim context to live migration
Calculate new NUMA topology on the destination and send it to the source
Source updates instance XML according to new NUMA topology calculated by the destination
Dependencies¶
None.
Testing¶
The libvirt/qemu driver used in the gate does not currently support NUMA features (though work is in progress [6]). Therefore, testing NUMA aware live migration in the upstream gate would require nested virt. In addition, the only assertable outcome of a NUMA live migration test (if it ever becomes possible) would be that the live migration succeeded. Examining the instance XML to assert things about its NUMA affinity or CPU pin mapping is explicitly out of tempest’s scope. For these reasons, NUMA aware live migration is best tested in third party CI [7] or other downstream test scenarios [8].
Documentation Impact¶
Current live migration documentation does not mention the NUMA limitations anywhere. Therefore, a release note explaining the new NUMA capabilities of live migration should be enough.
References¶
History¶
Release Name |
Description |
---|---|
Rocky |
Introduced |
Stein |
Re-proposed with modifications pertaining to claims and the exchange of information between destination and source. |
Train |
Re-proposed with no modifications. |