Initially it was assumed that there will only be a single kuryr-controller instance in the Kuryr-Kubernetes deployment. While it simplified a lot of controller code, it is obviously not a perfect situation. Having redundant controllers can help with achieving higher availability and scalability of the deployment.
Now with introduction of possibility to run Kuryr in Pods on Kubernetes cluster HA is much easier to be implemented. The purpose of this document is to explain how will it work in practice.
There are two types of HA - Active/Passive and Active/Active. In this document we’ll focus on the former. A/P basically works as one of the instances being the leader (doing all the exclusive tasks) and other instances waiting in standby mode in case the leader dies to take over the leader role. As you can see a leader election mechanism is required to make this work.
The idea here is to use leader election mechanism based on Kubernetes endpoints. The idea is neatly explained on Kubernetes blog. Election is based on Endpoint resources, that hold annotation about current leader and its leadership lease time. If leader dies, other instances of the service are free to take over the record. Kubernetes API mechanisms will provide update exclusion mechanisms to prevent race conditions.
This can be implemented by adding another leader-elector container to each of kuryr-controller pods:
- image: gcr.io/google_containers/leader-elector:0.5
name: leader-elector
args:
- "--election=kuryr-controller"
- "--http=0.0.0.0:${KURYR_CONTROLLER_HA_PORT:-16401}"
- "--election-namespace=kube-system"
- "--ttl=5s"
ports:
- containerPort: ${KURYR_CONTROLLER_HA_PORT:-16401}
protocol: TCP
This adds a new container to the pod. This container will do the leader-election and expose the simple JSON API on port 16401 by default. This API will be available to kuryr-controller container.
The main issue with having multiple controllers is task division. All of the controllers are watching the same endpoints and getting the same notifications, but those notifications cannot be processed by multiple controllers at once, because we end up with a huge race conditon, where each controller creates Neutron resources but only on succeeds to put the annotation on the Kubernetes resource it is processing.
This is obviously unacceptable so as a first step we’re implementing A/P HA, where only the leader is working on the resources and the rest waits in standby. This will be implemented by periodically calling the leader-elector API to check the current leader. On leader change:
Please note that this means that in HA mode Watcher will not get started on controller startup, but only when periodic task will notice that it is the leader.
There are certain issues related to orphaned OpenStack resources that we may hit. Those can happen in two cases:
Both of this issues can be tackled by garbage-collector mechanism that will periodically look over Kubernetes resources and delete OpenStack resources that have no representation in annotations.
The latter of the issues can also be tackled by saving last seen
resourceVersion
of watched resources list when stopping the Watcher and
restarting watching from that point.
It would be useful to implement the garbage collector and
resourceVersion
-based protection mechanism described in section above.
Besides that to further improve the scalability, we should work on Active/Active HA model, where work is divided evenly between all of the kuryr-controller instances. This can be achieved e.g. by using consistent hash ring to decide which instance will process which resource.
Potentially this can be extended with support for non-containerized deployments by using Tooz and some other tool providing leader-election - like Consul or Zookeeper.
Except where otherwise noted, this document is licensed under Creative Commons Attribution 3.0 License. See all OpenStack Legal Documents.