NUMA Topology with Resource Providers¶
https://blueprints.launchpad.net/nova/+spec/numa-topology-with-rps
Now that Nested Resource Providers is a thing in both Placement API and Nova compute nodes, we could use the Resource Providers tree for explaining the relationship between a root Resource Provider (root RP) ie. a compute node, and one or more Non-Uniform Memory Access (NUMA) nodes (aka. cells), each of them having separate resources, like memory or PCI devices.
Note
This spec only targets to model resource capabilities for NUMA nodes in some general and quite abstract manner. We won’t address in this spec how we should model NUMA-affinized hardware like PCI devices or GPUs and will discuss these relationships in a later spec.
Problem description¶
The NUMATopologyFilter checks a number of resources, including emulator threads policies, CPU pinned instances and memory page sizes. Additionally, it does two different verifications :
whether some host can fit the query because it has enough capacity
which resource(s) should be used for this query (eg. which pCPUs or NUMA node)
With NUMA topologies modeled as Placement resources, those two questions could be answered by the Placement service as potential allocation candidates that the filter would only be responsible for choosing between them in some very specific cases (eg. PCI device NUMA affinity, CPU pinning and NUMA anti-affinity).
Accordingly, we could model the host memory and the CPU topologies as a set of resource providers arranged in a tree, and just directly allocate resources for a specific instance from a resource provider subtree representing a NUMA node and its resources.
That said, non resource-related features (like choosing a specific CPU pin within a NUMA node for a vCPU) would still be only done by the virt driver, and are not covered by this spec.
Use Cases¶
Consider the following NUMA topology for a “2-NUMA nodes, 4 cores” host with no Hyper-Threading:
+--------------------------------------+
| CN1 |
+-+---------------+--+---------------+-+
| NUMA1 | | NUMA2 |
+-+----+-+----+-+ +-+----+-+----+-+
|CPU1| |CPU2| |CPU3| |CPU4|
+----+ +----+ +----+ +----+
Here, CPU1 and CPU2 would share the same memory through a common memory controller, while CPU3 and CPU4 would share their own memory.
Ideally, applications that require low-latency memory access from multiple vCPUs on the same instance (for parallel computing reasons) would like to ensure that those CPU resources are provided by the same NUMA node, or some performance penalties would occur (if your application is CPU-bound or I/O-bound of course). For the moment, if you’re an operator, you can use flavor extra specs to indicate a desired guest NUMA topology for your instance like:
$ openstack flavor set FLAVOR-NAME \
--property hw:numa_nodes=FLAVOR-NODES \
--property hw:numa_cpus.N=FLAVOR-CORES \
--property hw:numa_mem.N=FLAVOR-MEMORY
See all the NUMA possible extra specs for a flavor.
Note
The example above is only needed when you want to not evenly divide your virtual CPUs and memory between NUMA nodes, of course.
Proposed change¶
Given there are a lot of NUMA concerns, let’s do an iterative approach about the model we agree.
NUMA nodes being nested Resource Providers¶
Given virt drivers can amend a provider tree given by the compute node ResourceTracker, then the libvirt driver could create child providers for each of the 2 sockets representing separate NUMA node.
Since CPU resources are tied to a specific NUMA node, it makes sense to model
the corresponding resource classes as part of the child NUMA Resource
Providers. In order to facilitate querying NUMA resources, we propose to
decorate the NUMA child resource providers with a specific trait named
HW_NUMA_ROOT
that would be on each NUMA node. That would help to know
which hosts would be NUMA-aware and which others are not.
Memory is a bit tougher to represent. The granularity of a NUMA node having
an amount of attached memory is somehow a first approach but we’re missing the
point that the smallest allocatable unit you can assign with Nova is
really a page size. Accordingly, we should rather model our NUMA subtree
with children Resource Providers that represent the smallest unit of memory
you can allocate, ie. a page size. Since a pagesize is not a consumable
amount but rather a qualitative information that helps us to allocate
MEMORY_MB
resources, we propose three traits :
MEMORY_PAGE_SIZE_SMALL
andMEMORY_PAGE_SIZE_LARGE
would allow us to know whether the memory page size is default or optionally configured.CUSTOM_MEMORY_PAGE_SIZE_<X>
where <X> is an integer would allow us to know the size of the page in KB. To make it clear, even if the trait is a custom one, it’s important to have a naming convention for it so the scheduler could ask about page sizes without knowing all the traits.
+-------------------------------+
| <CN_NAME> |
| DISK_GB: 5 |
+-------------------------------+
| (no specific traits) |
+--+---------------------------++
| |
| |
+-------------------------+ +--------------------------+
| <NUMA_NODE_O> | | <NUMA_NODE_1> |
| VCPU: 8 | | VCPU: 8 |
| PCPU: 16 | | PCPU: 8 |
+-------------------------+ +--------------------------+
| HW_NUMA_ROOT | | HW_NUMA_ROOT |
+-------------------+-----+ +--------------------------+
/ | \ /+\
+ | \_____________________________ .......
| | \
+-------------+-----------+ +-+--------------------------+ +-------------------------------+
| <RP_UUID> | | <RP_UUID> | | <RP_UUID> |
| MEMORY_MB: 1024 | | MEMORY_MB: 1024 | |MEMORY_MB: 10240 |
| step_size=1 | | step_size=2 | |step_size=1024 |
+-------------------------+ +----------------------------+ +-------------------------------+
|MEMORY_PAGE_SIZE_SMALL | |MEMORY_PAGE_SIZE_LARGE | |MEMORY_PAGE_SIZE_LARGE |
|CUSTOM_MEMORY_PAGE_SIZE_4| |CUSTOM_MEMORY_PAGE_SIZE_2048| |CUSTOM_MEMORY_PAGE_SIZE_1048576|
+-------------------------+ +----------------------------+ +-------------------------------+
Note
As we said above, we don’t want to support children PCI devices for Ussuri at the moment. Other current children RPs for a root compute node, like ones for VGPU resources or bandwidth resources would still have their parent be the compute node.
NUMA RP¶
Resource Provider names for NUMA nodes shall follow a convention of
nodename_NUMA#
where nodename would be the hypervisor hostname (given by
the virt driver) and where NUMA# would literally be a string made of ‘NUMA’
postfixed by the NUMA cell ID which is provided by the virt driver.
Each NUMA node would be then a child Resource Provider, having two resource classes :
VCPU
: for telling how many virtual cores (not able to be pinned) the NUMA node has.PCPU
: for telling how many possible pinned cores the NUMA node has.
A specific trait should be decorating it as we explained : HW_NUMA_ROOT
.
Memory pagesize RP¶
Each NUMA RP should have child RPs for each possible memory page size per host, and having a single resource class :
MEMORY_MB
: for telling how much memory the NUMA node has in that specific page size.
This RP would be decorated by two traits :
either
MEMORY_PAGE_SIZE_SMALL
(default if not configured) orMEMORY_PAGE_SIZE_LARGE
(if large pages are configured)the size of the page size : CUSTOM_MEMORY_PAGE_SIZE_# (where # is the size in KB - default to 4 as the kernel defaults to 4KB page sizes)
Compute node RP¶
The root Resource Provider (ie. the compute node) would only provide resources for classes that are not NUMA-related. Existing children RPs for vGPUs or bandwidth-aware resources should still have this parent (until we discuss about NUMA affinity for PCI devices).
Optionally configured NUMA resources¶
Given there are NUMA workloads but also non-NUMA workloads, it’s also important for operators to just have compute nodes accepting the latter. That said, having the compute node resources to be split between multiple NUMA nodes could be a problem for those non-NUMA workloads if they want to keep the existing behaviour.
For example, say an instance with 2 vCPUs and one host having 2 NUMA nodes but each one only accepting one VCPU, then the Placement API wouldn’t accept that host (given each nested RP only accepts one VCPU). For that reason, we need to have a configuration for saying which resources should be nested. To reinforce the above, that means a host would be either NUMA or non-NUMA, hence non-NUMA workloads being set on a specific NUMA node if host is set so. The proposal we make here will be :
[compute]
enable_numa_reporting_to_placement = <bool> (default None for Ussuri)
For below, we will tell hosts as “NUMA-aware” ones that have this option be
True
. For hosts that have this option to False
they are explicitely
asked to have a legacy behaviour and will be called “non-NUMA-aware”.
Depending on the value of the option, Placement would accept or not a host for the according request. The resulting matrix can be:
+----------------------------------------+----------+-----------+----------+
| ``enable_numa_reporting_to_placement`` | ``None`` | ``False`` | ``True`` |
+========================================+==========+===========+==========+
| NUMA-aware flavors | Yes | No | Yes |
+----------------------------------------+----------+-----------+----------+
| NUMA-agnostic flavors | Yes | Yes | No |
+----------------------------------------+----------+-----------+----------+
where Yes
means that there could be allocation candidates from this host,
while No
means that no allocation candidates will be returned.
In order to distinghish compute nodes that have the False
value instead of
None
, we will decorate the former with a specific trait name
HW_NON_NUMA
. Accordingly, we will query Placement by adding this forbidden
trait for not getting nodes that operators explicitly don’t want them to
support NUMA-aware flavors.
Note
By default, the value for that configuration option will be None
for
upgrade reasons. By the Ussuri timeframe, operators will have to decide
which hosts they want to support NUMA-aware instances and which should be
dedicated for ‘non-NUMA-aware’ instances. A nova-status pre-upgrade check
command will be provided that will warn them to decide before upgrading to
Victoria, if the default value is about to change as we could decide later
in this cycle. Once we stop supporting None
(in Victoria or later), the
HW_NON_NUMA
trait would no longer be needed so we could stop querying
it.
Note
Since we allow a transition period for helping the operators to decide, we will also make clear that this is a one-way change and that we won’t provide a backwards support for turning a NUMA-aware host into a non-NUMA-aware host.
See the Upgrade impact section for further details.
Note
Since the discovery of a NUMA topology is made by virt drivers, it makes the population of those nested Resource Providers to necessarly be done by each virt driver. Consequently, while the above configuration option is said to be generic, the use of this option for populating the Resource Providers tree will only be done by the virt drivers. Of course, a shared module could be imagined for the sake of consistency between drivers, but this is an implementation detail.
The very simple case: I don’t care about a NUMA-aware instance¶
For flavors just asking for, say, vCPUs and memory without asking them to be NUMA-aware, then we will make a single Placement call asking to not land them on a NUMA-aware host:
resources=VCPU:<X>,MEMORY_MB=<Y>
&required=!HW_NUMA_ROOT
In this case, even if NUMA-aware hosts have enough resources for this query,
the Placement API won’t provide them but only non-NUMA-aware ones (given the
forbidden HW_NUMA_ROOT
trait).
We’re giving the possibility to the operator to shard their clouds between
NUMA-aware hosts and non-NUMA-aware hosts but that’s not really changing the
current behaviour as of now where operators create aggregates to make sure
non-NUMA-aware instances can’t land on NUMA-aware hosts.
See the Upgrade impact session for rolling upgrade situations where clouds are partially upgraded to Ussuri and where only a very few nodes are reshaped.
Asking for NUMA-aware vCPUs¶
As NUMA-aware hosts have a specific topology with memory being in a grand-child
RP, we basically need to ensure we can translate the existing expressiveness in
the flavor extra specs into a Placement allocation candidates query that asks
for parenting between the NUMA RP containing the VCPU
resources and the
memory pagesize RP containing the MEMORY_MB
resources.
Accordingly, here are some examples:
for a flavor of 8 VCPUs, 8GB of RAM and
hw:numa_nodes=2
:resources_MEM1=MEMORY_MB:4096 &required_MEM1=MEMORY_PAGE_SIZE_SMALL &resources_PROC1=VCPU:4 &required_NUMA1=HW_NUMA_ROOT &same_subtree=_MEM1,_PROC1,_NUMA1 &resources_MEM2=MEMORY_MB:4096 &required_MEM2=MEMORY_PAGE_SIZE_SMALL &resources_PROC2=VCPU:4 &required_NUMA2=HW_NUMA_ROOT &same_subtree=_MEM2,_PROC2,_NUMA2 &group_policy=none
Note
We use none
as a value for group_policy
which means that in this
example, allocation candidates can all be from PROC1
group meaning
that we defeat the purpose of having the resources separated into different
NUMA nodes (which is the purpose of hw:numa_nodes=2
). This is OK
as we will also modify the NUMATopologyFilter
to only accept
allocation candidates for a host that are in different NUMA nodes.
It will probably be implemented in the nova.virt.hardware
module but
that’s an implementation detail.
for a flavor of 8 VCPUs, 8GB of RAM and
hw:numa_nodes=1
:resources_MEM1=MEMORY_MB:8192 &required_MEM1=MEMORY_PAGE_SIZE_SMALL &resources_PROC1=VCPU:8 &required_NUMA1=HW_NUMA_ROOT &same_subtree=_MEM1,_PROC1,_NUMA1
for a flavor of 8 VCPUs, 8GB of RAM and
hw:numa_nodes=2&hw:numa_cpus.0=0,1&hw:numa_cpus.1=2,3,4,5,6,7
:resources_MEM1=MEMORY_MB:4096 &required_MEM1=MEMORY_PAGE_SIZE_SMALL &resources_PROC1=VCPU:2 &required_NUMA1=HW_NUMA_ROOT &same_subtree=_MEM1,_PROC1,_NUMA1 &resources_MEM2=MEMORY_MB:4096 &required_MEM2=MEMORY_PAGE_SIZE_SMALL &resources_PROC2=VCPU:6 &required_NUMA2=HW_NUMA_ROOT &same_subtree=_MEM2,_PROC2,_NUMA2 &group_policy=none
for a flavor of 8 VCPUs, 8GB of RAM and
hw:numa_nodes=2&hw:numa_cpus.0=0,1&hw:numa_mem.0=1024 &hw:numa_cpus.1=2,3,4,5,6,7&hw:numa_mem.1=7168
:resources_MEM1=MEMORY_MB:1024 &required_MEM1=MEMORY_PAGE_SIZE_SMALL &resources_PROC1=VCPU:2 &required_NUMA1=HW_NUMA_ROOT &same_subtree=_MEM1,_PROC1,_NUMA1 &resources_MEM2=MEMORY_MB:7168 &required_MEM2=MEMORY_PAGE_SIZE_SMALL &resources_PROC2=VCPU:6 &required_NUMA2=HW_NUMA_ROOT &same_subtree=_MEM2,_PROC2,_NUMA2 &group_policy=none
As you can understand, the VCPU
and MEMORY_MB
values will be a result
of the division of respectively the flavored vCPUs and the flavored memory by
the value of hw:numa_nodes
(which is actually already calculated and
provided as NUMATopology object information in the RequestSpec object).
Note
The translation mechanism from a flavor-based request into Placement query will be handled by the scheduler service.
Note
Since memory is provided as grand-child, we need to always ask for a
MEMORY_PAGE_SIZE_SMALL
which is the default.
Asking for specific memory page sizes¶
Operators defining a flavor of 2 vCPUs, 4GB of RAM and
hw:mem_page_size=2MB,hw:numa_nodes=2
will see that the Placement query will
become:
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=CUSTOM_MEMORY_PAGE_SIZE_2048
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=CUSTOM_MEMORY_PAGE_SIZE_2048
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
If you only want large page size support without really specifying which size
(eg. by specifying hw:mem_page_size=large
instead of, say, 2MB
), then
the above same request for large pages would translate into:
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=MEMORY_PAGE_SIZE_LARGE
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=MEMORY_PAGE_SIZE_LARGE
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
Asking the same with hw:mem_page_size=small
would translate into:
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_MEM1=MEMORY_PAGE_SIZE_SMALL
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_MEM2=MEMORY_PAGE_SIZE_SMALL
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
And eventually, asking with hw:mem_page_size=any
would mean:
resources_PROC1=VCPU:1
&resources_MEM1=MEMORY_MB:2048
&required_NUMA1=HW_NUMA_ROOT
&same_subtree=_PROC1,_MEM1,_NUMA1
&resources_PROC2=VCPU:1
&resources_MEM2=MEMORY_MB:2048
&required_NUMA2=HW_NUMA_ROOT
&same_subtree=_PROC2,_MEM2,_NUMA2
&group_policy=none
Note
As we said for vCPUs, given we query with group_policy=none
,
allocation candidates would be within the same NUMA node but that’s fine
since we also said that the scheduler filter would then no agree with
them if there is a hw:numa_nodes=X
there.
The fallback case for NUMA-aware flavors¶
In the Optionally configured NUMA resources section, we said that we would
want to accept NUMA-aware flavors to land on hosts that have the
enable_numa_reporting_to_placement
option set to None
. Since we can’t
yet build a OR
query for allocation candidates, we propose to make another
call to Placement.
In this specific call (we name it a fallback call), we want to get all
non-reshaped nodes that are not explicitly said to not support NUMA.
In this case, the request is fairly trivial since we decorated them with the
HW_NON_NUMA
trait:
resources=VCPU:<X>,MEMORY_MB=<Y>
&required=!HW_NON_NUMA,!HW_NUMA_ROOT
Then we would get all compute nodes that have the None
value (
including nodes that are still running the Train release in a rolling upgrade
fashion).
Of course, we would get nodes that could potentially not accept the
NUMA-aware flavor but we rely on the NUMATopologyFilter
for not selecting
them, exactly like what we do in Train.
There is some open question about whether we should do the fallback call only if the NUMA-specific call is not getting candidates or if we should generate the two calls either way and merge the results. The former is better for performance reasons since we avoid a potentially unnecessary call but would generate some potential spread/pack affinity issues. Here we all agree on the fact we can leave the question unresolved for now and defer the resolution to the implementation phase.
Alternatives¶
Modeling of NUMA resources could be done by using specific NUMA resource
classes, like NUMA_VCPU
or NUMA_MEMORY_MB
that would only be set for
children NUMA resource providers, and where VCPU
and MEMORY_MB
resource
classes would only be set on the root Resource Provider (here the compute
node).
If the Placement allocations candidates API was also able to provide a way to say ‘you can split the resources between resource providers’, we wouldn’t need to carry a specific configuration option for a long time. All hosts would then be reshaped to be NUMA-aware but then non-NUMA-aware instances could potentially land on those hosts. That wouldn’t change the fact that for optimal capacity, operators need to shard their clouds between NUMA workloads and non-NUMA ones, but from a Placement perspective, all hosts would be equal. This alternative proposal has largely already been discussed in a spec but the outcome consensus was that it was very difficult to implement and potentially not worth the difficulty.
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None, flavors won’t need to be modified since we will provide a translation mechanism. That said, we will explicitly explain in the documentation that we won’t support any placement-like extra specs in flavors.
Performance Impact¶
Only when changing the configuration option to True
, a reshape is done.
Other deployer impact¶
Operators would want to migrate some instances from hosts to anothers before explicitely enabling or disabling NUMA awareness on their nodes since they will have to consider the capacity usage accordingly as they will have to shard their cloud. This being said, this would only be necessary for clouds that weren’t yet already dividing NUMA-aware and non-NUMA-aware workloads between hosts thru aggregates.
Developer impact¶
None, except virt driver maintainers.
Upgrade impact¶
As described above, in order to prevent a flavor update during upgrade, we will provide a translation mechanism that will take the existing flavor extra spec properties and transform them into Placement numbered groups query.
Since there will be a configuration option for telling that a host would become NUMA-aware, the corresponding allocations accordingly have to change hence the virt drivers be responsible for providing a reshape mechanism that will eventually call the Placement API /reshaper endpoint when starting the compute service. This reshape implementation will absolutely need to consider the Fast Forward Upgrade (FFU) strategy where all controlplane is down and should possibly document any extra step required for FFU with an eventual removal in a couple of releases once all deployers no longer need this support.
Last but not the least, we will provide a transition period (at least during
the Ussuri timeframe) where operators can decide which hosts to dedicate to
NUMA-aware workloads. A specific nova-status pre-upgrade check
command
will warn them to do so before upgrading to Victoria.
Implementation¶
Assignee(s)¶
bauzas
sean-k-mooney
Feature Liaison¶
bauzas
Work Items¶
libvirt driver passing NUMA topology through
update_provider_tree()
APIHyper-V driver passing NUMA topology through
update_provider_tree()
APIPossible work on the NUMATopologyFilter to look at the candidates
Scheduler translating flavor extra specs for NUMA properties into Placement queries
nova-status pre-upgrade check
command
Dependencies¶
None.
Testing¶
Functional tests and unittests.
Documentation Impact¶
None.
References¶
Nested Resource Providers: https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/nested-resource-providers.html
choosing a specific CPU pin within a NUMA node for a vCPU: https://docs.openstack.org/nova/latest/admin/cpu-topologies.html#customizing-instance-cpu-pinning-policies
NUMA possible extra specs: https://docs.openstack.org/nova/latest/admin/flavors.html#extra-specs-numa-topology
Huge pages: https://docs.openstack.org/nova/latest/admin/huge-pages.html
Placement API /reshaper endpoint: https://developer.openstack.org/api-ref/placement/?expanded=id84-detail#reshaper
Placement can_split: https://review.opendev.org/#/c/658510/
physical CPU resources: https://specs.openstack.org/openstack/nova-specs/specs/train/approved/cpu-resources.html