Cinder volume replication - Disaster recovery¶
Overview¶
This is the disaster recovery scenario of a Cinder volume replication deployment. It should be read in conjunction with the Cinder volume replication page.
Scenario description¶
Disaster recovery involves an uncontrolled failover to the secondary site. Site-b takes over from a troubled site-a and becomes the de-facto primary site, which includes writes to its images. Control is passed back to site-a once it is repaired.
Warning
The charms support the underlying OpenStack servcies in their native ability to failover and failback. However, a significant degree of administrative care is still needed in order to ensure a successful recovery.
For example,
primary volume images that are currently in use may experience difficulty during their demotion to secondary status
running VMs will lose connectivity to their volumes
subsequent image resyncs may not be straightforward
Any work necessary to rectify data issues resulting from an uncontrolled failover is beyond the scope of the OpenStack charms and this document.
Simulation¶
For the sake of understanding some of the rudimentary aspects involved in disaster recovery a simulation is provided.
Preparation¶
Create the replicated data volume and confirm it is available:
openstack volume create --size 5 --type site-a-repl vol-site-a-repl-data
openstack volume list
Simulate a failure in site-a by turning off all of its Ceph MON daemons:
juju ssh site-a-ceph-mon/0 sudo systemctl stop ceph-mon.target
juju ssh site-a-ceph-mon/1 sudo systemctl stop ceph-mon.target
juju ssh site-a-ceph-mon/2 sudo systemctl stop ceph-mon.target
Modify timeout and retry settings¶
When a Ceph cluster fails communication between Cinder and the failed cluster will be interrupted and the RBD driver will accommodate with retries and timeouts.
To accelerate the failover mechanism, timeout and retry settings on the cinder-ceph unit in site-a can be modified:
juju ssh cinder-ceph-a/0
> sudo apt install -y crudini
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a rados_connect_timeout 1
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a rados_connection_retries 1
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a rados_connection_interval 0
> sudo crudini --set /etc/cinder/cinder.conf cinder-ceph-a replication_connect_timeout 1
> sudo systemctl restart cinder-volume
> exit
These configuration changes are only intended to be in effect during the failover transition period. They should be reverted afterwards since the default values are fine for normal operations.
Failover¶
Perform the failover of site-a, confirm its cinder-volume host is disabled, and that the volume remains available:
cinder failover-host cinder@cinder-ceph-a
cinder service-list
openstack volume list
Confirm that the Cinder log file (/var/log/cinder/cinder-volume.log
) on
unit cinder/0
contains the successful failover message: Failed over to
replication target successfully.
.
Revert timeout and retry settings¶
Revert the configuration changes made to the cinder-ceph backend:
juju ssh cinder-ceph-a/0
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a rados_connect_timeout
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a rados_connection_retries
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a rados_connection_interval
> sudo crudini --del /etc/cinder/cinder.conf cinder-ceph-a replication_connect_timeout
> sudo systemctl restart cinder-volume
> exit
Write to the volume¶
Create a VM (named ‘vm-with-data-volume’):
openstack server create --image focal-amd64 --flavor m1.tiny \
--key-name mykey --network int_net vm-with-data-volume
FLOATING_IP=$(openstack floating ip create -f value -c floating_ip_address ext_net)
openstack server add floating ip vm-with-data-volume $FLOATING_IP
Attach the volume to the VM, write some data to it, and detach it:
openstack server add volume vm-with-data-volume vol-site-a-repl-data
ssh -i ~/cloud-keys/mykey ubuntu@$FLOATING_IP
> sudo mkfs.ext4 /dev/vdc
> mkdir data
> sudo mount /dev/vdc data
> sudo chown ubuntu: data
> echo "This is a test." > data/test.txt
> sync
> sudo umount /dev/vdc
> exit
openstack server remove volume vm-with-data-volume vol-site-a-repl-data
Repair site-a¶
In the current example, site-a is repaired by starting the Ceph MON daemons:
juju ssh site-a-ceph-mon/0 sudo systemctl start ceph-mon.target
juju ssh site-a-ceph-mon/1 sudo systemctl start ceph-mon.target
juju ssh site-a-ceph-mon/2 sudo systemctl start ceph-mon.target
Confirm that the MON cluster is now healthy (it may take a while):
juju status site-a-ceph-mon
Unit Workload Agent Machine Public address Ports Message
site-a-ceph-mon/0 active idle 14 10.5.0.15 Unit is ready and clustered
site-a-ceph-mon/1* active idle 15 10.5.0.31 Unit is ready and clustered
site-a-ceph-mon/2 active idle 16 10.5.0.11 Unit is ready and clustered
Image resync¶
Putting site-a back online at this point will lead to two primary images for each replicated volume. This is a split-brain condition that cannot be resolved by the RBD mirror daemon. Hence, before failback is invoked each replicated volume will need a resync of its images (site-b images are more recent than the site-a images).
The image resync is a two-step process that is initiated on the ceph-rbd-mirror unit in site-a:
Demote the site-a images with the demote
action:
juju run-action --wait site-a-ceph-rbd-mirror/0 demote pools=cinder-ceph-a
Flag the site-a images for a resync with the resync-pools
action. The
pools
argument should point to the corresponding site’s pool, which by
default is the name of the cinder-ceph application for the site (here
‘cinder-ceph-a’):
juju run-action --wait site-a-ceph-rbd-mirror/0 resync-pools i-really-mean-it=true pools=cinder-ceph-a
The Ceph RBD mirror daemon will perform the resync in the background.
Failback¶
Prior to failback, confirm that the images of all replicated volumes in site-a
are fully synchronised. Perform a check with the ceph-rbd-mirror charm’s
status
action as per RBD image status:
juju run-action --wait site-a-ceph-rbd-mirror/0 status verbose=true | grep -A3 volume-
This will take a while.
The state and description for site-a images will transition to:
state: up+syncing
description: bootstrapping, IMAGE_SYNC/CREATE_SYNC_POINT
The intermediate values will look like:
state: up+replaying
description: replaying, {"bytes_per_second":110318.93,"entries_behind_primary":4712.....
The final values, as expected, will become:
state: up+replaying
description: replaying, {"bytes_per_second":0.0,"entries_behind_primary":0.....
The failback of site-a can now proceed:
cinder failover-host cinder@cinder-ceph-a --backend_id default
Confirm the original health of Cinder services (as per Cinder service list):
cinder service-list
Verification¶
Re-attach the volume to the VM and verify that the secondary device contains the expected data:
openstack server add volume vm-with-data-volume vol-site-a-repl-data
ssh -i ~/cloud-keys/mykey ubuntu@$FLOATING_IP
> sudo mount /dev/vdc data
> cat data/test.txt
This is a test.
We can also check the status of the image as per RBD image status to verify that the primary indeed resides in site-a again:
juju run-action --wait site-a-ceph-rbd-mirror/0 status verbose=true | grep -A3 volume-
volume-c44d4d20-6ede-422a-903d-588d1b0d51b0:
global_id: 3a4aa755-c9ee-4319-8ba4-fc494d20d783
state: up+stopped
description: local image is primary