Appendix M: Ceph Erasure Coding and Device Classing¶
Overview¶
This appendix is intended as a post deployment guide to re-configuring RADOS gateway pools to use erasure coding rather than replication. It also covers use of a specific device class (NVMe, SSD or HDD) when creating the erasure coding profile as well as other configuration options that need to be considered during deployment.
Note
Any existing data is maintained by following this process, however reconfiguration should take place immediately post deployment to avoid prolonged ‘copy-pool’ operations.
RADOS Gateway bucket weighting¶
The weighting of the various pools in a deployment drives the number of placement groups (PG’s) created to support each pool. In the ceph-radosgw charm this is configured for the data bucket using:
juju config ceph-radosgw rgw-buckets-pool-weight=20
Note the default of 20% - if the deployment is a pure ceph-radosgw deployment this value should be increased to the expected % use of storage. The device class also needs to be taken into account (but for erasure coding this needs to be specified post deployment via action execution).
Ceph automatic device classing¶
Newer versions of Ceph do automatic classing of OSD devices. Each OSD will be placed into ‘nvme’, ‘ssd’ or ‘hdd’ device classes. These can be used when creating erasure profiles or new CRUSH rules (see following sections).
The classes can be inspected using:
sudo ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-1 8.18729 root default
-5 2.72910 host node-laveran
2 nvme 0.90970 osd.2
5 ssd 0.90970 osd.5
7 ssd 0.90970 osd.7
-7 2.72910 host node-mees
1 nvme 0.90970 osd.1
6 ssd 0.90970 osd.6
8 ssd 0.90970 osd.8
-3 2.72910 host node-pytheas
0 nvme 0.90970 osd.0
3 ssd 0.90970 osd.3
4 ssd 0.90970 osd.4
Configuring erasure coding¶
The RADOS gateway makes use of a number of pools, but the only pool that should be converted to use erasure coding (EC) is the data pool:
default.rgw.buckets.data
All other pools should be replicated as they are by default.
To create a new EC profile and pool:
juju run-action --wait ceph-mon/0 create-erasure-profile \
name=nvme-ec device-class=nvme
juju run-action --wait ceph-mon/0 create-pool \
name=default.rgw.buckets.data.new \
pool-type=erasure \
erasure-profile-name=nvme-ec \
percent-data=90
The percent-data option should be set based on the type of deployment but if the RADOS gateway is the only target for the NVMe storage class, then 90% is appropriate (other RADOS gateway pools are tiny and use between 0.10% and 3% of storage)
Note
The create-erasure-profile action has a number of other options including adjustment of the K/M values which affect the computational overhead and underlying storage consumed per MB stored. Sane defaults are provided but they require a minimum of five hosts with block devices of the right class.
To avoid any creation/mutation of stored data during migration, shutdown all RADOS gateway instances:
juju run --application ceph-radosgw \
"sudo systemctl stop ceph-radosgw.target"
The existing buckets.data pool can then be copied and switched:
juju run-action --wait ceph-mon/0 rename-pool \
name=default.rgw.buckets.data \
new-name=default.rgw.buckets.data.old
juju run-action --wait ceph-mon/0 rename-pool \
name=default.rgw.buckets.data.new \
new-name=default.rgw.buckets.data
At this point the RADOS gateway instances can be restarted:
juju run --application ceph-radosgw \
"sudo systemctl start ceph-radosgw.target"
Once successful operation of the deployment has been confirmed, the old pool can be deleted:
juju run-action --wait ceph-mon/0 delete-pool \
name=default.rgw.buckets.data.old
Moving other RADOS gateway pools to NVMe storage¶
The buckets.data pool is the largest pool and the one that can make use of EC; other pools could also be migrated to the same storage class for consistent performance:
juju run-action --wait ceph-mon/0 create-crush-rule \
name=replicated_nvme device-class=nvme
The CRUSH rule for the other RADOS gateway pools can then be updated:
pools=".rgw.root
default.rgw.control
default.rgw.data.root
default.rgw.gc
default.rgw.log
default.rgw.intent-log
default.rgw.meta
default.rgw.usage
default.rgw.users.keys
default.rgw.users.uid
default.rgw.buckets.extra
default.rgw.buckets.index
default.rgw.users.email
default.rgw.users.swift"
for pool in $pools; do
juju run-action --wait ceph-mon/0 pool-set \
name=$pool key=crush_rule value=replicated_nvme
done