Active-Standby Amphora Setup using VRRP

https://blueprints.launchpad.net/octavia/+spec/activepassiveamphora

This blueprint describes how Octavia implements its Active/Standby solution. It will describe the high level topology and the proposed code changes from the current supported Single topology to realize the high availability loadbalancer scenario.

Problem description

A tenant should be able to start high availability loadbalancer(s) for the tenant’s backend services as follows:

  • The operator should be able to configure an Active/Standby topology through an octavia configuration file, which the loadbalancer shall support. An Active/Standby topology shall be supported by Octavia in addition to the Single topology that is currently supported.
  • In Active/Standby, two Amphorae shall host a replicated configuration of the load balancing services. Both amphorae will also deploy a Virtual Router Redundancy Protocol (VRRP) implementation [2].
  • Upon failure of the master amphora, the backup one shall seamlessly take over the load balancing functions. After the master amphora changes to a healthy status, the backup amphora shall give up the load balancing functions to the master again (see [2] section 3 for details on master election protocol).
  • Fail-overs shall be seamless to end-users and fail-over time should be minimized.
  • The following diagram illustrates the Active/Standby topology.

asciiflow:

+--------+
| Tenant |
|Service |
|  (1)   |
+--------+        +-----------+
| +--------+ +----+  Master   +----+
| | Tenant | |VIP |  Amphora  |IP1 |
| |Service | +--+-+-----+-----+-+--+
| |   (M)  |    | |MGMT |VRRP | |
| +--------+    | | IP  | IP1 | |
| |  Tenant     | +--+--++----+ |
| | Network     |    |   |      |   +-----------------+ Floating +---------+
v-v-------------^----+---v-^----v-^-+     Router      |    IP    |         |
^---------------+----v-^---+------+-+Floating <-> VIP <----------+ Internet|
|  Management   |      |   |      | |                 |          |         |
|    (MGMT)     |      |   |      | +-----------------+          +---------+
|   Network     |   +--+--++----+ |
|           Paired  |MGMT |VRRP | |
|               |   | IP  | IP2 | |
+-----------+   |   +-----+-----+ |
|  Octavia  |  ++---+  Backup   +-+--+
|Controller |  |VIP |  Amphora  |IP2 |
|    (s)    |  +----+-----------+----+
+-----------+
  • The newly introduced VRRP IPs shall communicate on the same tenant network (see security impact for more details).
  • The existing Haproxy Jinja configuration template shall include “peer” setup for state synchronization over the VRRP IP addresses.
  • The VRRP IP addresses shall work with both IPv4 and IPv6.

Proposed change

The Active/Standby loadbalancers require the following high level changes:

  • Add support of VRRP in the amphora base image through Keepalived.
  • Extend the controller worker to be able to spawn N amphorae associated with the same loadbalancer on N different compute nodes (This takes into account future work on Active/Active topology). The amphorae shall be allowed to use the VIP through “allow address pairing”. These amphorae shall replicate the same listeners, and pools configuration. Note: topology is a property of a load balancer and not of one of its amphorae.
  • Extend the amphora driver interface, the amphora REST driver, and Jinja configuration templates for the newly introduced VRRP service [4].
  • Develop a Keepalived driver.
  • Extend the network driver to become aware of the different loadbalancer topologies and add support of network creation. The network driver shall also pair the different amphorae in a given topology to the same VIP address.
  • Extend the controller worker to build the right flow/sub-flows according to the given topology. The controller worker is also responsible of creating the correct stores needed by other flow/sub-flows.
  • Extend the Octavia configuration and Operator API to support the Active/Standby topology.
  • MINOR: Extend the Health Manager to be aware of the role of the amphora (Master/Backup) [9]. If the health manager decided to spawn a new amphora to replace an unhealthy one (while a backup amphora is already in service), it must replicate the same VRRP priorities, ids, and authentication credentials to keep the loadbalancer in its appropriate configuration. Listeners associated with this load balancer shall be put in a DEGRADED provisioning state.

Alternatives

We could use heartbeats as an alternative to VRRP, which is also a widely adopted solution. Heartbeats better suit redundant file servers, filesystems, and databases rather than network services such as routers, firewalls, and loadbalancers. Willy Tarreau, the creator of Haproxy, provides a detailed view on the major differences between heartbeats and VRRP in [5].

Data model impact

The data model of the Octavia database shall be impacted as follows:

  • A new column in the load_balancer table shall indicate its topology. The topology field takes values from: SINGLE, or ACTIVE/STANDBY.
  • A new column in the amphora table shall indicate an amphora’s role in the topology. If the topology is SINGLE, the amphora role shall be STANDALONE. If the topology is ACTIVE/STANDBY, the amphora role shall be either MASTER or BACKUP. This role field will also be of use for the Active/Active topology.
  • New value tables for the loadbalancer topology and the amphorae roles.
  • New columns in the amphora table shall indicate the VRRP priority, the VRRP ID, and the VRRP interface of the amphora.
  • A new column in the listener table shall indicate the TCP port used for listener internal data synchronization.
  • VRRP groups define the common VRRP configurations for all listeners on an amphora. A new table shall hold the VRRP groups main configuration primitives including at least: VRRP authentication information, role and priority advertisement interval. Each Active/Standby loadbalancer defines one and only one VRRP group.

REST API impact

** Changes to amphora API: see [11] **

PUT /listeners/{amphora_id}/{listener_id}/haproxy

PUT /vrrp/upload

PUT /vrrp/{action}

GET /interface/{ip_addr}

** Changes to operator API: see [10] **

POST /loadbalancers * Successful Status Code - 202 * JSON Request Body Attributes ** vip - another JSON object with one required attribute from the following * net_port_id - uuid * subnet_id - uuid * floating_ip_id - uuid * floating_ip_network_id - uuid ** tenant_id - string - optional - default “0” * 36 (for now) ** name - string - optional - default null ** description - string - optional - default null ** enabled - boolean - optional - default true * JSON Response Body Attributes ** id - uuid ** vip - another JSON object * net_port_id - uuid * subnet_id - uuid * floating_ip_id - uuid * floating_ip_network_id - uuid ** tenant_id - string ** name - string ** description - string ** enabled - boolean ** provisioning_status - string enum - (ACTIVE, PENDING_CREATE, PENDING_UPDATE, PENDING_DELETE, DELETED, ERROR) ** operating_status - string enum - (ONLINE, OFFLINE, DEGRADED, ERROR) ** topology - string enum - (SINGLE, ACTIVE_STANDBY)

PUT /loadbalancers/{lb_id} * Successful Status Code - 202 * JSON Request Body Attributes ** name - string ** description - string ** enabled - boolean * JSON Response Body Attributes ** id - uuid ** vip - another JSON object * net_port_id - uuid * subnet_id - uuid * floating_ip_id - uuid * floating_ip_network_id - uuid ** tenant_id - string ** name - string ** description - string ** enabled - boolean ** provisioning_status - string enum - (ACTIVE, PENDING_CREATE, PENDING_UPDATE, PENDING_DELETE, DELETED, ERROR) ** operating_status - string enum - (ONLINE, OFFLINE, DEGRADED, ERROR) ** topology - string enum - (SINGLE, ACTIVE_STANDBY)

GET /loadbalancers/{lb_id} * Successful Status Code - 200 * JSON Response Body Attributes ** id - uuid ** vip - another JSON object * net_port_id - uuid * subnet_id - uuid * floating_ip_id - uuid * floating_ip_network_id - uuid ** tenant_id - string ** name - string ** description - string ** enabled - boolean ** provisioning_status - string enum - (ACTIVE, PENDING_CREATE, PENDING_UPDATE, PENDING_DELETE, DELETED, ERROR) ** operating_status - string enum - (ONLINE, OFFLINE, DEGRADED, ERROR) ** topology - string enum - (SINGLE, ACTIVE_STANDBY)

Security impact

  • The VRRP driver must automatically add a security group rule to the amphora’s security group to allow VRRP traffic (Protocol number 112) on the same tenant subnet.
  • The VRRP driver shall automatically add a security group rule to allow Authentication Header traffic (Protocol number 51).
  • VRRP driver shall support authentication-type MD5.
  • The HAProxy driver must be updated to automatically add a security group rule that allows multi-peers to synchronize their states.
  • Currently HAProxy does not support peer authentication, and state sync messages are in plaintext.
  • At this point, VRRP shall communicate on the same tenant network. The rationale is to fail-over based on a similar network interfaces condition which the tenant operates experience. Also, VRRP traffic and sync messages shall naturally inherit same protections applied to the tenant network. This may create fake fail-overs if the tenant network is under unplanned, heavy traffic. This is still better than failing over while the master is actually serving tenant’s traffic or not failing over at all if the master has failed services. Additionally, the Keepalived shall check the health of the HAproxy service.
  • In next steps the following shall be taken into account: * Tenant quotas and supported topologies. * Protection of VRRP Traffic, HAproxy state sync, Router IDs, and pass phrases in both packets and DB.

Notifications impact

None.

Other end user impact

  • The operator shall be able to specify the loadbalancer topology in the Octavia configuration file (used by default).

Performance Impact

The Active/Standby can consume up to twice the resources (storage, network, compute) as required by the Single Topology. Nevertheless, one single amphora shall be active (i.e. serving end-user) at any point in time. If the Master amphora is healthy, the backup one shall remain idle until it receives no VRRP advertisements from the master.

The VRRP requires executing health checks in the amphorae at fine grain granularity period. The health checks shall be as lightweight as possible such that VRRP is able to execute all check scripts within a predefined interval. If the check scripts failed to run within this predefined interval, VRRP may become unstable and may alternate the amphorae roles between MASTER and BACKUP incorrectly.

Other deployer impact

  • An amphora_topology config option shall be added. The controller worker shall change its taskflow behavior according to the requirement of different topologies.
  • By default, the amphora_topology is SINGLE and the ACTIVE/STANDBY topology shall be enabled/requested explicitly by operators.
  • The Keepalived version deployed in the amphora image must be newer than 1.2.8 to support unicast VRRP mode.

Developer impact

None.

Implementation

Assignee(s)

Sherif Abdelwahab (abdelwas)

Work Items

  • Amphora image update to include Keepalived.
  • Data model updates.
  • Control Worker extensions.
  • Keepalived driver.
  • Update Network driver.
  • Security rules.
  • Update Amphora REST APIs and Jinja Configurations.
  • Update Octavia Operator APIs.

Dependencies

Keepalived version deployed in the amphora image must be newer than 1.2.8 to support unicast VRRP mode.

Testing

  • Unit tests with tox.
  • Function tests with tox.

Documentation Impact

  • Description of the different supported topologies: Single, Active/Standby.
  • Octavia configuration file changes to enable the Active/Standby topology.
  • CLI changes to enable the Active/Standby topology.
  • Changes shall be introduced to the amphora APIs: see [11].