Open vSwitch hardware offloading¶
The purpose of this page is to describe how to enable Open vSwitch hardware offloading functionality available in OpenStack (using OpenStack Networking). This functionality was first introduced in the OpenStack Pike release. This page intends to serve as a guide for how to configure OpenStack Networking and OpenStack Compute to enable Open vSwitch hardware offloading.
The basics¶
Open vSwitch is a production quality, multilayer virtual switch licensed under the open source Apache 2.0 license. It is designed to enable massive network automation through programmatic extension, while still supporting standard management interfaces and protocols. Open vSwitch (OVS) allows Virtual Machines (VM) to communicate with each other and with the outside world. The OVS software based solution is CPU intensive, affecting system performance and preventing fully utilizing available bandwidth.
Term |
Definition |
---|---|
PF |
Physical Function. The physical Ethernet controller that supports SR-IOV. |
VF |
Virtual Function. The virtual PCIe device created from a physical Ethernet controller. |
Representor Port |
Virtual network interface similar to SR-IOV port that represents Nova instance. |
First Compute Node |
OpenStack Compute Node that can host Compute instances (Virtual Machines). |
Second Compute Node |
OpenStack Compute Node that can host Compute instances (Virtual Machines). |
Supported Ethernet controllers¶
The following manufacturers are known to work:
Mellanox ConnectX-4 NIC (VLAN Offload)
Mellanox ConnectX-4 Lx/ConnectX-5 NICs (VLAN/VXLAN Offload)
Broadcom NetXtreme-S series NICs
Broadcom NetXtreme-E series NICs
For information on Mellanox Ethernet Cards, see Mellanox: Ethernet Cards - Overview.
Prerequisites¶
Linux Kernel >= 4.13
Open vSwitch >= 2.8
iproute >= 4.12
Mellanox or Broadcom NIC
Note
Mellanox NIC FW that supports Open vSwitch hardware offloading:
ConnectX-5 >= 16.21.0338
ConnectX-4 >= 12.18.2000
ConnectX-4 Lx >= 14.21.0338
Using Open vSwitch hardware offloading¶
In order to enable Open vSwitch hardware offloading, the following steps are required:
Enable SR-IOV
Configure NIC to switchdev mode (relevant Nodes)
Enable Open vSwitch hardware offloading
Note
Throughout this guide, enp3s0f0
is used as the PF and eth3
is used
as the representor port. These ports may vary in different environments.
Note
Throughout this guide, we use systemctl
to restart OpenStack services.
This is correct for systemd
OS. Other methods to restart services should be
used in other environments.
Create Compute virtual functions¶
Create the VFs for the network interface that will be used for SR-IOV. We use
enp3s0f0
as PF, which is also used as the interface for the VLAN provider
network and has access to the private networks of all nodes.
Note
The following steps detail how to create VFs using Mellanox ConnectX-4 and SR-IOV Ethernet cards on an Intel system. Steps may be different for the hardware of your choice.
Ensure SR-IOV and VT-d are enabled on the system. Enable IOMMU in Linux by adding
intel_iommu=on
to kernel parameters, for example, using GRUB.On each Compute node, create the VFs:
# echo '4' > /sys/class/net/enp3s0f0/device/sriov_numvfs
Note
A network interface can be used both for PCI passthrough, using the PF, and SR-IOV, using the VFs. If the PF is used, the VF number stored in the
sriov_numvfs
file is lost. If the PF is attached again to the operating system, the number of VFs assigned to this interface will be zero. To keep the number of VFs always assigned to this interface, update a relevant file according to your OS. See some examples below:In Ubuntu, modifying the
/etc/network/interfaces
file:auto enp3s0f0 iface enp3s0f0 inet dhcp pre-up echo '4' > /sys/class/net/enp3s0f0/device/sriov_numvfs
In Red Hat, modifying the
/sbin/ifup-local
file:#!/bin/sh if [[ "$1" == "enp3s0f0" ]] then echo '4' > /sys/class/net/enp3s0f0/device/sriov_numvfs fi
Warning
Alternatively, you can create VFs by passing the
max_vfs
to the kernel module of your network interface. However, themax_vfs
parameter has been deprecated, so the PCI /sys interface is the preferred method.You can determine the maximum number of VFs a PF can support:
# cat /sys/class/net/enp3s0f0/device/sriov_totalvfs 8
Verify that the VFs have been created and are in
up
state:Note
The PCI bus number of the PF (03:00.0) and VFs (03:00.2 .. 03:00.5) will be used later.
# lspci | grep Ethernet 03:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 03:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 03:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 03:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 03:00.4 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] 03:00.5 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
# ip link show enp3s0f0 8: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000 link/ether a0:36:9f:8f:3f:b8 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto vf 1 MAC 00:00:00:00:00:00, spoof checking on, link-state auto vf 2 MAC 00:00:00:00:00:00, spoof checking on, link-state auto vf 3 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
If the interfaces are down, set them to
up
before launching a guest, otherwise the instance will fail to spawn:# ip link set enp3s0f0 up
Configure Open vSwitch hardware offloading¶
Change the e-switch mode from legacy to switchdev on the PF device. This will also create the VF representor network devices in the host OS.
# echo 0000:03:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind
This tells the driver to unbind VF 03:00.2
Note
This should be done for all relevant VFs (in this example 0000:03:00.2 .. 0000:03:00.5)
Enable Open vSwitch hardware offloading, set PF to switchdev mode and bind VFs back.
# sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev # sudo ethtool -K enp3s0f0 hw-tc-offload on # echo 0000:03:00.2 > /sys/bus/pci/drivers/mlx5_core/bind
Note
This should be done for all relevant VFs (in this example 0000:03:00.2 .. 0000:03:00.5)
Restart Open vSwitch
# sudo systemctl enable openvswitch.service # sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true # sudo systemctl restart openvswitch.service
Note
The given aging of OVS is given in milliseconds and can be controlled with:
# ovs-vsctl set Open_vSwitch . other_config:max-idle=30000
Configure Nodes (VLAN Configuration)¶
Update
/etc/neutron/plugins/ml2/ml2_conf.ini
on Controller nodes[ml2] tenant_network_types = vlan type_drivers = vlan mechanism_drivers = openvswitch
Update
/etc/neutron/neutron.conf
on Controller nodes[DEFAULT] core_plugin = ml2
Update
/etc/nova/nova.conf
on Controller nodes[filter_scheduler] enabled_filters = PciPassthroughFilter
Update
/etc/nova/nova.conf
on Compute nodes[pci] #VLAN Configuration passthrough_whitelist example passthrough_whitelist ={"'"address"'":"'"*:'"03:00"'.*"'","'"physical_network"'":"'"physnet2"'"}
Configure Nodes (VXLAN Configuration)¶
Update
/etc/neutron/plugins/ml2/ml2_conf.ini
on Controller nodes[ml2] tenant_network_types = vxlan type_drivers = vxlan mechanism_drivers = openvswitch
Update
/etc/neutron/neutron.conf
on Controller nodes[DEFAULT] core_plugin = ml2
Update
/etc/nova/nova.conf
on Controller nodes[filter_scheduler] enabled_filters = PciPassthroughFilter
Update
/etc/nova/nova.conf
on Compute nodesNote
VXLAN configuration requires physical_network to be null.
[pci] #VLAN Configuration passthrough_whitelist example passthrough_whitelist ={"'"address"'":"'"*:'"03:00"'.*"'","'"physical_network"'":null}
Restart nova and neutron services
# sudo systemctl restart openstack-nova-compute.service # sudo systemctl restart openstack-nova-scheduler.service # sudo systemctl restart neutron-server.service
Validate Open vSwitch hardware offloading¶
Note
In this example we will bring up two instances on different Compute nodes and send ICMP echo packets between them. Then we will check TCP packets on a representor port and we will see that only the first packet will be shown there. All the rest will be offloaded.
Create a port
direct
onprivate
network# openstack port create --network private --vnic-type=direct direct_port1
Create an instance using the direct port on ‘First Compute Node’
# openstack server create --flavor m1.small --image cloud_image --nic port-id=direct_port1 vm1
Repeat steps above and create a second instance on ‘Second Compute Node’
# openstack port create --network private --vnic-type=direct direct_port2 # openstack server create --flavor m1.small --image mellanox_fedora --nic port-id=direct_port2 vm2
Note
You can use –availability-zone nova:compute_node_1 option to set the desired Compute Node
Connect to instance1 and send ICMP Echo Request packets to instance2
# vncviewer localhost:5900 vm_1# ping vm2
Connect to ‘Second Compute Node’ and find representor port of the instance
Note
Find a representor port first, in our case it’s eth3
compute_node2# ip link show enp3s0f0 6: enp3s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether ec:0d:9a:46:9e:84 brd ff:ff:ff:ff:ff:ff vf 0 MAC 00:00:00:00:00:00, spoof checking off, link-state enable, trust off, query_rss off vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state enable, trust off, query_rss off vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state enable, trust off, query_rss off vf 3 MAC fa:16:3e:b9:b8:ce, vlan 57, spoof checking on, link-state enable, trust off, query_rss off compute_node2# ls -l /sys/class/net/ lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth0 -> ../../devices/virtual/net/eth0 lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth1 -> ../../devices/virtual/net/eth1 lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth2 -> ../../devices/virtual/net/eth2 lrwxrwxrwx 1 root root 0 Sep 11 10:54 eth3 -> ../../devices/virtual/net/eth3 compute_node2# sudo ovs-dpctl show system@ovs-system: lookups: hit:1684 missed:1465 lost:0 flows: 0 masks: hit:8420 total:1 hit/pkt:2.67 port 0: ovs-system (internal) port 1: br-enp3s0f0 (internal) port 2: br-int (internal) port 3: br-ex (internal) port 4: enp3s0f0 port 5: tapfdc744bb-61 (internal) port 6: qr-a7b1e843-4f (internal) port 7: qg-79a77e6d-8f (internal) port 8: qr-f55e4c5f-f3 (internal) port 9: eth3
Check traffic on the representor port. Verify that only the first ICMP packet appears.
compute_node2# tcpdump -nnn -i eth3 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth3, link-type EN10MB (Ethernet), capture size 262144 bytes 17:12:41.220447 ARP, Request who-has 172.0.0.10 tell 172.0.0.13, length 46 17:12:41.220684 ARP, Reply 172.0.0.10 is-at fa:16:3e:f2:8b:23, length 42 17:12:41.260487 IP 172.0.0.13 > 172.0.0.10: ICMP echo request, id 1263, seq 1, length 64 17:12:41.260778 IP 172.0.0.10 > 172.0.0.13: ICMP echo reply, id 1263, seq 1, length 64 17:12:46.268951 ARP, Request who-has 172.0.0.13 tell 172.0.0.10, length 42 17:12:46.271771 ARP, Reply 172.0.0.13 is-at fa:16:3e:1a:10:05, length 46 17:12:55.354737 IP6 fe80::f816:3eff:fe29:8118 > ff02::1: ICMP6, router advertisement, length 64 17:12:56.106705 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 62:21:f0:89:40:73, length 300