Attention
Please review the active-active topology blueprint first ( Active-Active, N+1 Amphorae Setup )
https://blueprints.launchpad.net/octavia/+spec/active-active-topology
This blueprint describes how Octavia implements a Distributor to support the active-active loadbalancer (LB) solution, as described in the blueprint linked above. It presents the high-level Distributor design and suggests high-level code changes to the current code base to realize this design.
In a nutshell, in an active-active topology, an Amphora Cluster of two or more active Amphorae collectively provide the loadbalancing service. It is designed as a 2-step loadbalancing process; first, a lightweight distribution of VIP traffic over an Amphora Cluster; then, full-featured loadbalancing of traffic over the back-end members. Since a single loadbalancing service, which is addressable by a single VIP address, is served by several Amphorae at the same time, there is a need to distribute incoming requests among these Amphorae – that is the role of the Distributor.
This blueprint uses terminology defined in the Octavia glossary when available, and defines new terms to describe new components and features as necessary.
Note: Items marked with [P2] refer to lower priority features to be designed / implemented only after initial release.
Front-End Back-End
Internet Networks Networks
(world) (tenants) (tenants)
║ A B C A B C
┌──╨───┐floating IP ║ ║ ║ ┌────────┬──────────┬────┐ ║ ║ ║
│ ├─ to VIP ──►╢◄──────║───────║──┤f.e. IPs│ Amphorae │b.e.├►╜ ║ ║
│ │ LB A ║ ║ ║ └──┬─────┤ of │ IPs│ ║ ║
│ │ ║ ║ ║ │VIP A│ Tenant A ├────┘ ║ ║
│ GW │ ║ ║ ║ └─────┴──────────┘ ║ ║
│Router│floating IP ║ ║ ║ ┌────────┬──────────┬────┐ ║ ║
│ ├─ to VIP ───║──────►╟◄──────║──┤f.e. IPs│ Amphorae │b.e.├──►╜ ║
│ │ LB B ║ ║ ║ └──┬─────┤ of │ IPs│ ║
│ │ ║ ║ ║ │VIP B│ Tenant B ├────┘ ║
│ │ ║ ║ ║ └─────┴──────────┘ ║
│ │floating IP ║ ║ ║ ┌────────┬──────────┬────┐ ║
│ ├─ to VIP ───║───────║──────►╢◄─┤f.e. IPs│ Amphorae │b.e.├────►╜
└──────┘ LB C ║ ║ ║ └──┬─────┤ of │ IPs│
║ ║ ║ │VIP C│ Tenant C ├────┘
arp─►╢ arp─►╢ arp─►╢ └─────┴──────────┘
┌─┴─┐ ║┌─┴─┐ ║┌─┴─┐ ║
│VIP│┌►╜│VIP│┌►╜│VIP│┌►╜
├───┴┴┐ ├───┴┴┐ ├───┴┴┐
│IP A │ │IP B │ │IP C │
┌┴─────┴─┴─────┴─┴─────┴┐
│ │
│ Distributor │
│ (multi-tenant) │
└───────────────────────┘
Affinity is required to make sure related packets are forwarded to the same Amphora. At minimum, since TCP connections are terminated at the Amphora, all packets that belong to the same flow must be sent to the same Amphora. Enhanced affinity levels can be used to make sure that flows with similar attributes are always sent to the same Amphora; this may be desired to achieve better performance (see discussion below).
[P2] The Distributor shall support different modes of client-to-Amphora affinity. The operator should be able to select and configure the desired affinity level.
Since the Distributor is L3 and the “heavy lifting” is expected to be done by the Amphorae, this specification proposes implementing two practical affinity alternatives. Other affinity alternatives may be implemented at a later time.
In this mode, the Distributor must always send packets from the same combination of Source IP and Source port to the same Amphora. Since the Target IP and Target Port are fixed per Listener, this mode implies that all packets from the same TCP flow are sent to the same Amphora. This is the minimal affinity mode, as without it TCP connections will break.
Note: related flows (e.g., parallel client calls from the same HTML page) will typically be distributed to different Amphorae; however, these should still be routed to the same back-end. This could be guaranteed by using cookies and/or by synchronizing the stick-tables. Also, the Amphorae in the Cluster could be configured to use the same hashing parameters (avoid any random seed) to ensure all make similar decisions.
In this mode, the Distributor must always send packets from the same source IP to the same Amphora, regardless of port. This mode allows TLS session reuse (e.g., through session ids), where an abbreviated handshake can be used to improve latency and computation time.
The main disadvantage of sending all traffic from the same source IP to the same Amphora is that it might lead to poor load distribution for large workloads that have the same source IP (e.g., workload behind a single nat or proxy).
In some (typical) TLS sessions, the additional load incurred for each new session is significantly larger than the load incurred for each new request or connection on the same session; namely, the total load on each Amphora will be more affected by the number of different source IPs it serves than by the number of connections. Moreover, since the total load on the Cluster incurred by all the connections depends on the level of session reuse, spreading a single source IP over multiple Amphorae increases the overall load on the Cluster. Thus, a Distributor that uniformly spreads traffic without affinity per source IP (e.g., uses per-flow affinity only) might cause an increase in overall load on the Cluster that is proportional to the number of Amphorae. For example, in a scale-out scenario (where a new Amphora is spawned to share the total load), moving some flows to the new Amphora might increase the overall Cluster load, negating the benefit of scaling-out.
Session reuse helps with the certificate exchange phase. Improvements in performance with the certificate exchange depend on the type of keys used, and is greatest with RSA. Session reuse may be less important with other schemes; shared TLS session tickets are another mechanism that may circumvent the problem; also, upcoming versions of HA-Proxy may be able to obviate this problem by synchronizing TLS state between Amphorae (similar to stick-table protocol).
Per the agreement at the Mitaka mid-cycle, the default affinity shall be based on source-IP only and a consistent hashing function (see below) shall be used to distribute flows in a predictable manner; however, abstraction will be used to allow other implementations at a later time.
The reference implementation of the Distributor shall use OVS for forwarding and configure the Distributor through OpenFlow rules.
Outline of Rules
A group with the select method is used to distribute IP traffic over multiple Amphorae. There is one bucket per Amphora – adding an Amphora adds a new bucket and deleting and Amphora removes the corresponding bucket.
The select method supports (OpenFlow v1.5) hashed-based selection of the bucket. The hash can be set up to use different fields, including by source IP only (default) and by source IP and source port.
All buckets route traffic back on the in-port (i.e., no forwarding between ports). This ensures that the same front-end network is used (i.e., the Distributor does not route between front-end networks; therefore, does not mix traffic of different tenants).
The bucket actions do a re-write of the outgoing packets. It supports re-write of the destination MAC to that of the specific Amphora and re-write of the source MAC to that of the Distributor interface (together these MAC re-writes provide L3 routing functionality).
Note: alternative re-write rules can be used to support other forwarding mechanisms.
OpenFlow rules are also used to answer arp requests on the VIP. arp requests for each VIP are captured, re-written as arp replies with the MAC address of the particular front-end interface and sent back on the in-port. Again, there is no routing between interfaces.
Handling Amphora failure
Handling Distributor failover
Note: These are changes on top of the changes described in the “Active-Active, N+1 Amphorae Setup” blueprint, (see https://blueprints.launchpad.net/octavia/+spec/active-active-topology)
Create flow for the creation of an Amphora cluster with N active Amphora and one extra standby Amphora. Set-up the Amphora roles accordingly.
Support the creation, connection, and configuration of the various networks and interfaces as described in high-level topology diagram. The Distributor shall have a separate interface for each loadbalancer and shall not allow any routing between different ports. In particular, when a loadbalancer is created the Distributor should:
[P2] It is desirable that the Distributor be considered as a router by Neutron (to handle port security, network forwarding without arp spoofing, etc.). This may require changes to Neutron and may also mean that Octavia will be a privileged user of Neutron.
Distributor needs to support IPv6 NDP
[P2] If the Distributor is implemented as a container then hot-plugging a port for each VIP might not be possible.
If DVR is used then routing rules must be used to forward external traffic to the Distributor rather than rely on arp. In particular, DVR messes-up noarp settings.
Support Amphora failure recovery
Distributor driver and Distributor image
Define a REST API for Distributor configuration (no SSH API). See below for details.
Create data-model for Distributor.
TBD
Add table distributor with the following columns:
ID of Distributor instance.
ID of compute node running the Distributor.
IP of Distributor on management network.
Provisioning status
List of Neutron port IDs. New VIFs may be plugged into the Distributor when a new LB is created. We may need to store the Neutron port IDs in order to support fail-over from one Distributor instance to another.
Add table distributor_health with the following columns:
ID of Distributor instance.
Last time distributor heartbeat was received by a health monitor.
Field indicating a create / delete or other action is being conducted on the distributor instance (ie. to prevent a race condition when multiple health managers are in use).
Add table amphora_registration with the following columns. This describes which Amphorae are registered with which Distributors and in which order:
ID of load balancer.
ID of Distributor instance.
ID of Amphora instance.
Order in which Amphorae are registered with the Distributor.
Distributor will be running its own rest API server. This API will be secured using two-way SSL authentication, and use certificate rotation in the same way this is done with Amphorae today.
Following API calls will be addressed.
Post VIP Plug
Adding a VIP network interface to the Distributor involves tasks which run outside the Distributor itself. Once these are complete, the Distributor must be configured to use the new interface. This is a REST call, similar to what is currently done for Amphorae when connecting to a new member network.
An identifier for the particular loadbalancer/VIP. Used for subsequent register/unregister of Amphorae.
The IP of the VIP (for which IP to answer arp requests)
Netmask for the VIP’s subnet.
Gateway outbound packets from the VIP ip address should use.
MAC address of the new interface corresponding to the VIP.
In the case of HA Distributor, this contains the IP address that will be used in setting up the allowed address pairs relationship. (See Amphora VIP plugging under the ACTIVE-STANDBY topology for an example of how this is used.)
List of routes that should be added when the VIP is plugged.
Extra arguments related to the algorithm that will be used to distribute requests to Amphorae part of this load balancer configuration. This consists of an algorithm name and affinity type. In the initial release of ACTIVE-ACTIVE, the only valid algorithm will be hash, and the affinity type may be Source_IP or [P2] Source_IP_AND_port.
Pre VIP unplug
Removing a VIP network interface will involve several tasks on the Distributor to gracefully roll-back OVS configuration and other details that were set-up when the VIP was plugged in.
ID of the VIP’s loadbalancer that will be unplugged.
Register Amphorae
This adds Amphorae to the configuration for a given load balancer. The Distributor should respond with a new list of all Amphorae registered with the Distributor with positional information.
ID of the loadbalancer with which the Amphora will be registered
List of Amphorae MAC addresses and (optional) position argument in which they should be registered.
Unregister Amphorae
This removes Amphorae from the configuration for a given load balancer. The Distributor should respond with a new list of all Amphorae registered with the Distributor with positional information.
ID of the loadbalancer with which the Amphora will be registered
List of Amphorae MAC addresses that should be unregistered with the Distributor.
The Distributor is designed to be multi-tenant by default. (Note that the first reference implementation will not be multi-tenant until tests can be developed to verify the security of a multi-tenant reference distributor.) Although each tenant has its own front-end network, the Distributor is connected to all, which might allow leaks between these networks. The rationale is two fold: First, the Distributor should be considered as a trusted infrastructure component. Second, all traffic is external traffic before it reaches the Amphora. Note that the GW router has exactly the same attributes; in other words, logically, we can consider the Distributor to be an extension to the GW (or even use the GW HW to implement the Distributor).
This approach might not be considered secure enough for some cases, such as, if LBaaS is used for internal tier-to-tier communication inside a tenant network. Some tenants may want their loadbalancer’s VIP to remain private and their front-end network to be isolated. In these cases, in order to accomplish active-active for this tenant we would need separate dedicated Distributor instance(s).
Note
This section captures some background, ideas, concerns, and remarks that were raised by various people. Some of the items here can be considered for future/alternative design and some will hopefully make their way into, yet to be written, related blueprints (e.g., auto-scaled topology).
The Distributor shall support different mechanism for preserving affinity of flows to Amphorae following a change in the size of the Amphorae Cluster.
The goal is to minimize shuffling of client-to-Amphora mapping during cluster size changes:
Using a simple hash to maintain affinity does not meet this goal.
For example, suppose we maintain affinity (for a fixed cluster size) using a hash (for randomizing key distribution) as chosen_amphora_id = hash(sourceIP # port) mod number_of_amphorae. When a new Amphora is added or remove the number of Amphorae changes; thus, a different Amphora will be chosen for most flows.
Below are the couple of ways to tackle this shuffling problem.
Consistent hashing is a hashing mechanism (regardless if key is based on IP or IP/port) that preserves most hash mappings during changes in the size of the Amphorae Cluster. In particular, for a cluster with N Amphorae that grows to N+1 Amphorae, a consistent hashing function ensures that, with high probability, only 1/N of inputs flows will be re-hashed (more precisely, K/N keys will be rehashed). Note that, even with consistent hashing, some flows will be remapped and there is only a statistical bound on the number of remapped flows.
The “classic” consistent hashing algorithm maps both server IDs and keys to hash values and selects for each key the server with the closest hash value to the key hash value. Lookup generally requires O(log N) to search for the “closest” server. Achieving good distribution requires multiple hashes per server (~10s) – although these can be pre-computed there is an ~10s*N memory footprint. Other algorithms (e.g., MSFT’s Magleb) have better performance, but provide weaker guarantees.
There are several consistent hashing libraries available. None are supported in OVS.
We should also strongly consider making any consistent hashing algorithm we develop available to all OpenStack components by making it part of an Oslo library.
This method provides similar properties to Consistent Hashing (i.e., a hashing function that remaps only 1/N of keys when a cluster with N Amphorae grows to N+1 Amphorae.
For each server ID, the algorithm concatenates the key and server ID and computes a hash. The server with the largest hash is chosen. This approach requires O(N) for each lookup, but is much simpler to implement and has virtually no memory footprint. Through search-tree encoding of the server IDs it is possible to achieve O(log N) lookup, but implementation is harder and distribution is not as good. Another feature is that more than one server can be chosen (e.g., two largest values) to handle larger loads – not directly useful for the Distributor use case.
This is an alternative implementation of consistent hashing that may be simpler to implement. Keys are hashed to a set of buckets; each bucket is pre-mapped to a random permutation of the server IDs. Lookup is by computing a hash of the key to obtain a bucket and then going over the permutation selecting the first server. If a server is marked as “down” the next server in the list is chosen. This approach is similar to Rendezvous hashing if each key is directly pre-mapped to a random permutation (and like it allows more than one server selection). If the number of failed servers is small then lookup is about O(1); memory is O(N * #buckets), where the granularity of distribution is improved by increasing the number of buckets. The permutation-based approach is useful to support clusters of fixed size that need to handle a few nodes going down and then coming back up. If there is an assumption on the number of failures then memory can be reduced to O( max_failures * #buckets). This approach seems to suit the Distributor Active-Active use-case for non-elastic workloads.
Flow tracking is required, even with the above hash functions, to handle the (relatively few) remapped flows. If an existing flow is remapped, its TCP connection would break. This is acceptable when an Amphora goes down and it flows are mapped to a new one. On the other hand, it may be unacceptable when an Amphora is added to the cluster and 1/N of existing flows are remapped. The Distributor may support different modes, as follows.
In this mode, the Distributor applies its most recent forwarding rules, regardless of previous state. Some existing flows might be remapped to a different Amphora and would be broken. The client would have to recover and establish a connection with the new Amphora (it would still be mapped to the same back-end, if possible). Combined with consistent (or similar) hashing, this may be good enough for many web applications that are built for failure anyway, and can restore their state upon reconnect.
In this mode, the Distributor tracks existing flows to provide full affinity, i.e., only new flows can be remapped to different Amphorae. The Linux connection tracking may be used (e.g., through IPTables or through OpenFlow); however, this might not scale well. Alternatively, the Distributor can use an independent mechanism similar to HA-Proxy sticky-tables to track the flows. Note that the Distributor only needs to track the mapping per source IP and source port (unlike Linux connection tracking which follows the TCP state and related connections).
Ryu is a well supported and tested python binding for issuing OpenFlow commands. Especially since Neutron recently moved to using this for many of the things it does, using this in the Distributor might make sense for Octavia as well.
The current design uses L2 forwarding based only on L3 parameters and uses Direct Return routing (one-legged). The rational behind this approach is to keep the Distributor as light as possible and have the Amphorae do the bulk of the work. This allows one (or a few) Distributor instance(s) to serve all traffic even for very large workloads. Other approaches are possible.
Use LVS for Distributor.
Use DNS for the Distributor.
[1] | https://blueprints.launchpad.net/octavia/+spec/base-image |
[2] | https://blueprints.launchpad.net/octavia/+spec/controller-worker |
[3] | https://blueprints.launchpad.net/octavia/+spec/amphora-driver-interface |
[4] | https://blueprints.launchpad.net/octavia/+spec/controller |
[5] | https://blueprints.launchpad.net/octavia/+spec/operator-api |
[6] | Octavia HAProxy Amphora API |
[7] | https://blueprints.launchpad.net/octavia/+spec/active-active-topology |