Cinder volume replication - Disaster recovery

Overview

This is the disaster recovery scenario of a Cinder volume replication deployment. It should be read in conjunction with the Cinder volume replication page.

Scenario description

Disaster recovery involves an uncontrolled failover to the secondary site. Site-b takes over from a troubled site-a and becomes the de-facto primary site, which includes writes to its images. Control is passed back to site-a once it is repaired.

Warning

The charms support the underlying OpenStack servcies in their native ability to failover and failback. However, a significant degree of administrative care is still needed in order to ensure a successful recovery.

For example,

  • primary volume images that are currently in use may experience difficulty during their demotion to secondary status

  • running VMs will lose connectivity to their volumes

  • subsequent image resyncs may not be straightforward

Any work necessary to rectify data issues resulting from an uncontrolled failover is beyond the scope of the OpenStack charms and this document.

Simulation

For the sake of understanding some of the rudimentary aspects involved in disaster recovery a simulation is provided.

Preparation

Create the replicated data volume and confirm it is available:

openstack volume create --size 5 --type site-a-repl vol-site-a-repl-data
openstack volume list

Simulate a failure in site-a by turning off all of its Ceph MON daemons:

juju ssh site-a-ceph-mon/0 sudo systemctl stop ceph-mon.target
juju ssh site-a-ceph-mon/1 sudo systemctl stop ceph-mon.target
juju ssh site-a-ceph-mon/2 sudo systemctl stop ceph-mon.target

Modify timeout and retry settings

When a Ceph cluster fails communication between Cinder and the failed cluster will be interrupted and the RBD driver will accommodate with retries and timeouts.

To accelerate the failover mechanism, timeout and retry settings on the cinder-ceph unit in site-a can be modified:

juju ssh cinder-ceph-a/0
> sudo apt install -y crudini
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a rados_connect_timeout 1
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a rados_connection_retries 1
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a rados_connection_interval 0
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a replication_connect_timeout 1
> sudo systemctl restart cinder-volume
> exit

These configuration changes are only intended to be in effect during the failover transition period. They should be reverted afterwards since the default values are fine for normal operations.

Failover

Perform the failover of site-a, confirm its cinder-volume host is disabled, and that the volume remains available:

cinder failover-host cinder@cinder-ceph-a
cinder service-list
openstack volume list

Confirm that the Cinder log file (/var/log/cinder/cinder-volume.log) on unit cinder/0 contains the successful failover message: Failed over to replication target successfully..

Revert timeout and retry settings

Revert the configuration changes made to the cinder-ceph backend:

juju ssh cinder-ceph-a/0
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a rados_connect_timeout
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a rados_connection_retries
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a rados_connection_interval
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a replication_connect_timeout
> sudo systemctl restart cinder-volume
> exit

Write to the volume

Create a VM (named ‘vm-with-data-volume’):

openstack server create --image focal-amd64 --flavor m1.tiny \
   --key-name mykey --network int_net vm-with-data-volume

FLOATING_IP=$(openstack floating ip create -f value -c floating_ip_address ext_net)
openstack server add floating ip vm-with-data-volume $FLOATING_IP

Attach the volume to the VM, write some data to it, and detach it:

openstack server add volume vm-with-data-volume vol-site-a-repl-data

ssh -i ~/cloud-keys/mykey ubuntu@$FLOATING_IP
> sudo mkfs.ext4 /dev/vdc
> mkdir data
> sudo mount /dev/vdc data
> sudo chown ubuntu: data
> echo "This is a test." > data/test.txt
> sync
> sudo umount /dev/vdc
> exit

openstack server remove volume vm-with-data-volume vol-site-a-repl-data

Repair site-a

In the current example, site-a is repaired by starting the Ceph MON daemons:

juju ssh site-a-ceph-mon/0 sudo systemctl start ceph-mon.target
juju ssh site-a-ceph-mon/1 sudo systemctl start ceph-mon.target
juju ssh site-a-ceph-mon/2 sudo systemctl start ceph-mon.target

Confirm that the MON cluster is now healthy (it may take a while):

juju status site-a-ceph-mon

Unit                       Workload  Agent  Machine  Public address  Ports  Message
site-a-ceph-mon/0          active    idle   14       10.5.0.15              Unit is ready and clustered
site-a-ceph-mon/1*         active    idle   15       10.5.0.31              Unit is ready and clustered
site-a-ceph-mon/2          active    idle   16       10.5.0.11              Unit is ready and clustered

Image resync

Putting site-a back online at this point will lead to two primary images for each replicated volume. This is a split-brain condition that cannot be resolved by the RBD mirror daemon. Hence, before failback is invoked each replicated volume will need a resync of its images (site-b images are more recent than the site-a images).

The image resync is a two-step process that is initiated on the ceph-rbd-mirror unit in site-a:

Demote the site-a images with the demote action:

juju run-action --wait site-a-ceph-rbd-mirror/0 demote pools=cinder-ceph-a

Flag the site-a images for a resync with the resync-pools action. The pools argument should point to the corresponding site’s pool, which by default is the name of the cinder-ceph application for the site (here ‘cinder-ceph-a’):

juju run-action --wait site-a-ceph-rbd-mirror/0 resync-pools i-really-mean-it=true pools=cinder-ceph-a

The Ceph RBD mirror daemon will perform the resync in the background.

Failback

Prior to failback, confirm that the images of all replicated volumes in site-a are fully synchronised. Perform a check with the ceph-rbd-mirror charm’s status action as per RBD image status:

juju run-action --wait site-a-ceph-rbd-mirror/0 status verbose=true | grep -A3 volume-

This will take a while.

The state and description for site-a images will transition to:

state:       up+syncing
description: bootstrapping, IMAGE_SYNC/CREATE_SYNC_POINT

The intermediate values will look like:

state:       up+replaying
description: replaying, {"bytes_per_second":110318.93,"entries_behind_primary":4712.....

The final values, as expected, will become:

state:       up+replaying
description: replaying, {"bytes_per_second":0.0,"entries_behind_primary":0.....

The failback of site-a can now proceed:

cinder failover-host cinder@cinder-ceph-a --backend_id default

Confirm the original health of Cinder services (as per Cinder service list):

cinder service-list

Verification

Re-attach the volume to the VM and verify that the secondary device contains the expected data:

openstack server add volume vm-with-data-volume vol-site-a-repl-data
ssh -i ~/cloud-keys/mykey ubuntu@$FLOATING_IP
> sudo mount /dev/vdc data
> cat data/test.txt
This is a test.

We can also check the status of the image as per RBD image status to verify that the primary indeed resides in site-a again:

juju run-action --wait site-a-ceph-rbd-mirror/0 status verbose=true | grep -A3 volume-

volume-c44d4d20-6ede-422a-903d-588d1b0d51b0:
  global_id:   3a4aa755-c9ee-4319-8ba4-fc494d20d783
  state:       up+stopped
  description: local image is primary