https://blueprints.launchpad.net/kuryr-kubernetes/+spec/kuryr-kubernetes-sriov-support
This spec proposes an approach to allow kuryr-kubernetes manage pods that require SR-IOV interfaces.
SR-IOV (Single-root input/output virtualization) is a technique that allows a single physical PCIe device to be shared across several clients (VMs or otherwise). Each such network card would have a single PF (physical function) and multiple VFs (Virtual Functions), essentially appearing as multiple PCIe devices. These VFs can be then passed-through to VMs bypassing the hypervisor and virtual switch. This allows performance comparable to non-virtualized environments. SR-IOV support is present in nova and neutron, see docs [1].
It is possible to implement a similar approach within Kubernetes. Since Kubernetes uses separate network namespaces for Pods, it is possible to implement pass-through, simply by assigning a VF device to the desired Pod’s namespace.
There are several challenges that this task poses:
Proposed solution consists of two major parts: add SR-IOV capabilities to VIF handler of kuryr-kubernetes controller and enhance CNI to allow it associate VFs to Pods.
Since Kubernetes is the one who actually schedules the Pods on a Node we need a way to tell it that a particular node is capable of handling a SR-IOV-enabled Pods. There are several techniques in Kubernetes, that allow limiting where a pod should be scheduled (i.e. Labels and NodeSelectors, Taints and Tolerations), but only Opaque Integer Resources [2] (OIR) allows exact bookkeeping of VFs. This spec proposes to use a predefined OIR pattern to track VFs on a node::
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-<PHYSNET_NAME>
For example to request VFs for physnet2
it would be::
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet2
It will be deployer’s duty to set these resources, during node setup.
kubectl
does not support setting ORI as of yet, so it has to be done as a
PATCH request to Kubernetes API. For example to add 7 VFs from physnet2
to
k8s-node-1
one would issue the following request:
curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path":
"/status/capacity/pod.alpha.kubernetes.io~1opaque-int-resource-sriov-vf-physnet2",
"value": "7"}]' \
http://k8s-master:8080/api/v1/nodes/k8s-node-1/status
For more information please refer to OIR docs. [2] This process may be automated, using Node Feature Discovery [3] or a similar service, however these details are out of the scope of this spec.
Here’s how A Pod Spec might look like this:
spec:
containers:
- name: vf-container
image: vf-image
resources:
requests:
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet2: 1
- name: vf-other-container
image: vf-other-image
resources:
requests:
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet2: 1
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet3: 1
These requests are per-container, and the total amount of VFs should be
totalled for the Pod, the same way Kubernetes does it.
The example above would require 2 VFs from physnet2
and 1 from physnet3
.
An important note should be made about kubernetes Init Containers [4]. If we decide that it is important to support requests from Init Containers, they would have to be treated differently. Init Containers are designed to run sequentially, so we would need to scan them and get maximum request value across all of them.
To implement SR-IOV capabilities current VIF handler will be modified to handle multiple VIFs. As a prerequisite of this the following changes have to be implemented:
Instead of storing a single VIF in the annotation VIFHandler would store a dict, that maps desired interface name to a VIF object. As an alternative we can store VIFs in a list, but dict would give finer control over interface naming. Both handler and the CNI would have to be modified to understand this new format of the annotation. The CNI may also be kept backward-compatible, i.e. understand the old single-VIF format.
Even though this functionality is not a part of SR-IOV handling it acts as a prerequisite and would be implemented as part of this spec.
The handler would read OIR requests of a
scheduled Pod and would see if the Pod has requested any SR-IOV VFs. (NOTE: at
this point the Pod should already be scheduled to a node, meaning there are
enough available VFs on that node). The handler would ask SR-IOV driver for
sufficient number of direct
ports from neutron and pass them on
to the CNI via annotations. Network information should also include network’s
VLAN info, to setup VF VLAN.
SR-IOV functionality requires additional knowledge of neutron subnets. The controller needs to know a subnet where it would allocate direct ports for certain physnet. This can be solved by adding a config setting that will map physnets to a default neutron subnet It might look like this:
default_physnet_subnets = "physnet2:e603a1cc-57e5-40fe-9af1-9fbb30905b10,physnet3:0919e15a-b619-440c-a07e-bb5a28c11a75"
Alternatively we can request this information from neutron. However since there can be multiple networks within a single physnet and multiple subnets within a single network there is a lot of space for ambiguity. Finally we can combine the approaches: request info from neutron only if it’s not set in the config.
On the CNI side we will implement a CNI binding driver for SR-IOV ports.
Since this work will be based on top of multi-vif support for both CNI and
controller, no additional format changes would be implemented.
The driver would configure the VF and pass it to the Pod’s namespace.
It would scan /sys/class/net/<PF>/device
directory for available
virtual functions and pass the acquired device to Pods namespace.
The driver would need to know which
devices map to which physnets. Therefore we would introduce a config
setting physical_device_mappings
. It will be identical to
neutron-sriov-nic-agent’s setting. It might look like:
physical_device_mappings = "physnet2:enp1s0f0,physnet3:enp1s0f1"
As an alternative to storing this setting in kuryr.conf
we may store it in
/etc/cni/net.d/kuryr.conf
file or in a kubernetes node annotation.
Initial implementation followed an alternative path, where SR-IOV functionality has been implemented as a separate handler/cni. This sparked several design discussions, where community agreed, that multi-VIF handler is preferred over multi-handler approach. However if implementing multi-vif handler would prove to be lengthy and difficult we may go with a 2-phase approach. First phase: polish and merge initial implementation. Second phase: Implement multi-vif approach and convert sriov-handler to use it.
Primary assignee: Zaitsev Kirill
[1] | https://docs.openstack.org/ocata/networking-guide/config-sriov.html |
[2] | (1, 2, 3) https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature |
[3] | https://github.com/kubernetes-incubator/node-feature-discovery |
[4] | https://kubernetes.io/docs/concepts/workloads/pods/init-containers/ |
[5] | https://github.com/hustcat/sriov-cni |
[6] | https://github.com/Intel-Corp/multus-cni |
[7] | https://github.com/Huawei-PaaS/CNI-Genie |
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.