Partitioned Consistent Hash Ring¶
Ring¶
- class swift.common.ring.ring.Ring(serialized_path, reload_time=None, ring_name=None, validation_hook=<function Ring.<lambda>>)¶
Bases:
object
Partitioned consistent hashing ring.
- Parameters
serialized_path – path to serialized RingData instance
reload_time – time interval in seconds to check for a ring change
ring_name – ring name string (basically specified from policy)
validation_hook – hook point to validate ring configuration ontime
- Raises
RingLoadError – if the loaded ring data violates its constraint
- property assigned_device_count¶
Number of devices with assignments in the ring.
- property device_count¶
Number of devices in the ring.
- property devs¶
devices in the ring
- get_more_nodes(part)¶
Generator to get extra nodes for a partition for hinted handoff.
The handoff nodes will try to be in zones other than the primary zones, will take into account the device weights, and will usually keep the same sequences of handoffs even with ring changes.
- Parameters
part – partition to get handoff nodes for
- Returns
generator of node dicts
See
get_nodes()
for a description of the node dicts.
- get_nodes(account, container=None, obj=None)¶
Get the partition and nodes for an account/container/object. If a node is responsible for more than one replica, it will only appear in the output once.
- Parameters
account – account name
container – container name
obj – object name
- Returns
a tuple of (partition, list of node dicts)
Each node dict will have at least the following keys:
id
unique integer identifier amongst devices
index
offset into the primary node list for the partition
weight
a float of the relative weight of this device as compared to others; this indicates how many partitions the builder will try to assign to this device
zone
integer indicating which zone the device is in; a given partition will not be assigned to multiple devices within the same zone
ip
the ip address of the device
port
the tcp port of the device
device
the device’s name on disk (sdb1, for example)
meta
general use ‘extra’ field; for example: the online date, the hardware description
- get_part(account, container=None, obj=None)¶
Get the partition for an account/container/object.
- Parameters
account – account name
container – container name
obj – object name
- Returns
the partition number
- get_part_nodes(part)¶
Get the nodes that are responsible for the partition. If one node is responsible for more than one replica of the same partition, it will only appear in the output once.
- Parameters
part – partition to get nodes for
- Returns
list of node dicts
See
get_nodes()
for a description of the node dicts.
- has_changed()¶
Check to see if the ring on disk is different than the current one in memory.
- Returns
True if the ring on disk has changed, False otherwise
- property md5¶
- property next_part_power¶
- property part_power¶
- property partition_count¶
Number of partitions in the ring.
- property raw_size¶
- property replica_count¶
Number of replicas (full or partial) used in the ring.
- property size¶
- property version¶
- property weighted_device_count¶
Number of devices with weight in the ring.
- class swift.common.ring.ring.RingData(replica2part2dev_id, devs, part_shift, next_part_power=None, version=None)¶
Bases:
object
Partitioned consistent hashing ring data (used for serialization).
- classmethod deserialize_v1(gz_file, metadata_only=False)¶
Deserialize a v1 ring file into a dictionary with devs, part_shift, and replica2part2dev_id keys.
If the optional kwarg metadata_only is True, then the replica2part2dev_id is not loaded and that key in the returned dictionary just has the value [].
- Parameters
gz_file (file) – An opened file-like object which has already consumed the 6 bytes of magic and version.
metadata_only (bool) – If True, only load devs and part_shift
- Returns
A dict containing devs, part_shift, and replica2part2dev_id
- classmethod load(filename, metadata_only=False)¶
Load ring data from a file.
- Parameters
filename – Path to a file serialized by the save() method.
metadata_only (bool) – If True, only load devs and part_shift.
- Returns
A RingData instance containing the loaded data.
- property replica_count¶
Number of replicas (full or partial) used in the ring.
- save(filename, mtime=1300507380.0)¶
Serialize this RingData instance to disk.
- Parameters
filename – File into which this instance should be serialized.
mtime – time used to override mtime for gzip, default or None if the caller wants to include time
- serialize_v1(file_obj)¶
- to_dict()¶
- class swift.common.ring.ring.RingReader(filename)¶
Bases:
object
- chunk_size = 65536¶
- property close¶
- property md5¶
- read(amount=- 1)¶
- readinto(buffer)¶
- readline()¶
- seek(pos, ref=0)¶
- swift.common.ring.ring.calc_replica_count(replica2part2dev_id)¶
Ring Builder¶
- class swift.common.ring.builder.RingBuilder(part_power, replicas, min_part_hours)¶
Bases:
object
Used to build swift.common.ring.RingData instances to be written to disk and used with swift.common.ring.Ring instances. See bin/swift-ring-builder for example usage.
The instance variable devs_changed indicates if the device information has changed since the last balancing. This can be used by tools to know whether a rebalance request is an isolated request or due to added, changed, or removed devices.
- Parameters
part_power – number of partitions = 2**part_power.
replicas – number of replicas for each partition
min_part_hours – minimum number of hours between partition changes
- add_dev(dev)¶
Add a device to the ring. This device dict should have a minimum of the following keys:
id
unique integer identifier amongst devices. Defaults to the next id if the ‘id’ key is not provided in the dict
weight
a float of the relative weight of this device as compared to others; this indicates how many partitions the builder will try to assign to this device
region
integer indicating which region the device is in
zone
integer indicating which zone the device is in; a given partition will not be assigned to multiple devices within the same (region, zone) pair if there is any alternative
ip
the ip address of the device
port
the tcp port of the device
device
the device’s name on disk (sdb1, for example)
meta
general use ‘extra’ field; for example: the online date, the hardware description
Note
This will not rebalance the ring immediately as you may want to make multiple changes for a single rebalance.
- Parameters
dev – device dict
- Returns
id of device (not used in the tree anymore, but unknown users may depend on it)
- cancel_increase_partition_power()¶
Cancels a ring partition power increasement.
This sets the next_part_power to the current part_power. Object replicators will still skip replication, and a cleanup is still required. Finally, a finish_increase_partition_power needs to be run.
- Returns
False if next_part_power was not set or is equal to current part_power, otherwise True.
- change_min_part_hours(min_part_hours)¶
Changes the value used to decide if a given partition can be moved again. This restriction is to give the overall system enough time to settle a partition to its new location before moving it to yet another location. While no data would be lost if a partition is moved several times quickly, it could make that data unreachable for a short period of time.
This should be set to at least the average full partition replication time. Starting it at 24 hours and then lowering it to what the replicator reports as the longest partition cycle is best.
- Parameters
min_part_hours – new value for min_part_hours
- copy_from(builder)¶
Reinitializes this RingBuilder instance from data obtained from the builder dict given. Code example:
b = RingBuilder(1, 1, 1) # Dummy values b.copy_from(builder)
This is to restore a RingBuilder that has had its b.to_dict() previously saved.
- debug()¶
Temporarily enables debug logging, useful in tests, e.g.:
with rb.debug(): rb.rebalance()
- property ever_rebalanced¶
- finish_increase_partition_power()¶
Finish the partition power increase.
The hard links from the old object locations should be removed by now.
- classmethod from_dict(builder_data)¶
- get_balance()¶
Get the balance of the ring. The balance value is the highest percentage of the desired amount of partitions a given device wants. For instance, if the “worst” device wants (based on its weight relative to the sum of all the devices’ weights) 123 partitions and it has 124 partitions, the balance value would be 0.83 (1 extra / 123 wanted * 100 for percentage).
- Returns
balance of the ring
- get_part_devices(part)¶
Get the devices that are responsible for the partition, filtering out duplicates.
- Parameters
part – partition to get devices for
- Returns
list of device dicts
- get_required_overload(weighted=None, wanted=None)¶
Returns the minimum overload value required to make the ring maximally dispersed.
The required overload is the largest percentage change of any single device from its weighted replicanth to its wanted replicanth (note: under weighted devices have a negative percentage change) to archive dispersion - that is to say a single device that must be overloaded by 5% is worse than 5 devices in a single tier overloaded by 1%.
- get_ring()¶
Get the ring, or more specifically, the swift.common.ring.RingData. This ring data is the minimum required for use of the ring. The ring builder itself keeps additional data such as when partitions were last moved.
- property id¶
- increase_partition_power()¶
Increases ring partition power by one.
Devices will be assigned to partitions like this:
OLD: 0, 3, 7, 5, 2, 1, … NEW: 0, 0, 3, 3, 7, 7, 5, 5, 2, 2, 1, 1, …
- Returns
False if next_part_power was not set or is equal to current part_power, None if something went wrong, otherwise True.
- classmethod load(builder_file, open=<built-in function open>, **kwargs)¶
Obtain RingBuilder instance of the provided builder file
- Parameters
builder_file – path to builder file to load
- Returns
RingBuilder instance
- property min_part_seconds_left¶
Get the total seconds until a rebalance can be performed
- property part_shift¶
- prepare_increase_partition_power()¶
Prepares a ring for partition power increase.
This makes it possible to compute the future location of any object based on the next partition power.
In this phase object servers should create hard links when finalizing a write to the new location as well. A relinker will be run after restarting object-servers, creating hard links to all existing objects in their future location.
- Returns
False if next_part_power was not set, otherwise True.
- pretend_min_part_hours_passed()¶
Override min_part_hours by marking all partitions as having been moved 255 hours ago and last move epoch to ‘the beginning of time’. This can be used to force a full rebalance on the next call to rebalance.
- rebalance(seed=None)¶
Rebalance the ring.
This is the main work function of the builder, as it will assign and reassign partitions to devices in the ring based on weights, distinct zones, recent reassignments, etc.
The process doesn’t always perfectly assign partitions (that’d take a lot more analysis and therefore a lot more time – I had code that did that before). Because of this, it keeps rebalancing until the device skew (number of partitions a device wants compared to what it has) gets below 1% or doesn’t change by more than 1% (only happens with a ring that can’t be balanced no matter what).
- Parameters
seed – a value for the random seed (optional)
- Returns
(number_of_partitions_altered, resulting_balance, number_of_removed_devices)
- remove_dev(dev_id)¶
Remove a device from the ring.
Note
This will not rebalance the ring immediately as you may want to make multiple changes for a single rebalance.
- Parameters
dev_id – device id
- save(builder_file)¶
Serialize this RingBuilder instance to disk.
- Parameters
builder_file – path to builder file to save
- search_devs(search_values)¶
Search devices by parameters.
- Parameters
search_values – a dictionary with search values to filter devices, supported parameters are id, region, zone, ip, port, replication_ip, replication_port, device, weight, meta
- Returns
list of device dicts
- set_dev_region(dev_id, region)¶
Set the region of a device. This should be called rather than just altering the region key in the device dict directly, as the builder will need to rebuild some internal state to reflect the change.
Note
This will not rebalance the ring immediately as you may want to make multiple changes for a single rebalance.
- Parameters
dev_id – device id
region – new region for device
- set_dev_weight(dev_id, weight)¶
Set the weight of a device. This should be called rather than just altering the weight key in the device dict directly, as the builder will need to rebuild some internal state to reflect the change.
Note
This will not rebalance the ring immediately as you may want to make multiple changes for a single rebalance.
- Parameters
dev_id – device id
weight – new weight for device
- set_dev_zone(dev_id, zone)¶
Set the zone of a device. This should be called rather than just altering the zone key in the device dict directly, as the builder will need to rebuild some internal state to reflect the change.
Note
This will not rebalance the ring immediately as you may want to make multiple changes for a single rebalance.
- Parameters
dev_id – device id
zone – new zone for device
- set_overload(overload)¶
- set_replicas(new_replica_count)¶
Changes the number of replicas in this ring.
If the new replica count is sufficiently different that self._replica2part2dev will change size, sets self.devs_changed. This is so tools like bin/swift-ring-builder can know to write out the new ring rather than bailing out due to lack of balance change.
- to_dict()¶
Returns a dict that can be used later with copy_from to restore a RingBuilder. swift-ring-builder uses this to pickle.dump the dict to a file and later load that dict into copy_from.
- validate(stats=False)¶
Validate the ring.
This is a safety function to try to catch any bugs in the building process. It ensures partitions have been assigned to real devices, aren’t doubly assigned, etc. It can also optionally check the even distribution of partitions across devices.
- Parameters
stats – if True, check distribution of partitions across devices
- Returns
if stats is True, a tuple of (device_usage, worst_stat), else (None, None). device_usage[dev_id] will equal the number of partitions assigned to that device. worst_stat will equal the number of partitions the worst device is skewed from the number it should have.
- Raises
RingValidationError – problem was found with the ring.
- weight_of_one_part()¶
Returns the weight of each partition as calculated from the total weight of all the devices.
- exception swift.common.ring.builder.RingValidationWarning¶
Bases:
Warning
Composite Ring Builder¶
A standard ring built using the ring-builder will attempt to randomly disperse replicas or erasure-coded fragments across failure domains, but does not provide any guarantees such as placing at least one replica of every partition into each region. Composite rings are intended to provide operators with greater control over the dispersion of object replicas or fragments across a cluster, in particular when there is a desire to have strict guarantees that some replicas or fragments are placed in certain failure domains. This is particularly important for policies with duplicated erasure-coded fragments.
A composite ring comprises two or more component rings that are combined to
form a single ring with a replica count equal to the sum of replica counts
from the component rings. The component rings are built independently, using
distinct devices in distinct regions, which means that the dispersion of
replicas between the components can be guaranteed. The composite_builder
utilities may then be used to combine components into a composite ring.
For example, consider a normal ring ring0
with replica count of 4 and
devices in two regions r1
and r2
. Despite the best efforts of the
ring-builder, it is possible for there to be three replicas of a particular
partition placed in one region and only one replica placed in the other region.
For example:
part_n -> r1z1h110/sdb r1z2h12/sdb r1z3h13/sdb r2z1h21/sdb
Now consider two normal rings each with replica count of 2: ring1
has
devices in only r1
; ring2
has devices in only r2
.
When these rings are combined into a composite ring then every partition is
guaranteed to be mapped to two devices in each of r1
and r2
, for
example:
part_n -> r1z1h10/sdb r1z2h20/sdb r2z1h21/sdb r2z2h22/sdb
|_____________________| |_____________________|
| |
ring1 ring2
The dispersion of partition replicas across failure domains within each of the two component rings may change as they are modified and rebalanced, but the dispersion of replicas between the two regions is guaranteed by the use of a composite ring.
For rings to be formed into a composite they must satisfy the following requirements:
All component rings must have the same part power (and therefore number of partitions)
All component rings must have an integer replica count
Each region may only be used in one component ring
Each device may only be used in one component ring
Under the hood, the composite ring has a _replica2part2dev_id
table that is
the union of the tables from the component rings. Whenever the component rings
are rebalanced, the composite ring must be rebuilt. There is no dynamic
rebuilding of the composite ring.
Note
The order in which component rings are combined into a composite ring is very significant because it determines the order in which the Ring.get_part_nodes() method will provide primary nodes for the composite ring and consequently the node indexes assigned to the primary nodes. For an erasure-coded policy, inadvertent changes to the primary node indexes could result in large amounts of data movement due to fragments being moved to their new correct primary.
The id
of each component RingBuilder is therefore stored in metadata of
the composite and used to check for the component ordering when the same
composite ring is re-composed. RingBuilder id
s are normally assigned
when a RingBuilder instance is first saved. Older RingBuilder instances
loaded from file may not have an id
assigned and will need to be saved
before they can be used as components of a composite ring. This can be
achieved by, for example:
swift-ring-builder <builder-file> rebalance --force
- class swift.common.ring.composite_builder.CompositeRingBuilder(builder_files=None)¶
Bases:
object
Provides facility to create, persist, load, rebalance and update composite rings, for example:
# create a CompositeRingBuilder instance with a list of # component builder files crb = CompositeRingBuilder(["region1.builder", "region2.builder"]) # perform a cooperative rebalance of the component builders crb.rebalance() # call compose which will make a new RingData instance ring_data = crb.compose() # save the composite ring file ring_data.save("composite_ring.gz") # save the composite metadata file crb.save("composite_builder.composite") # load the persisted composite metadata file crb = CompositeRingBuilder.load("composite_builder.composite") # compose (optionally update the paths to the component builder files) crb.compose(["/path/to/region1.builder", "/path/to/region2.builder"])
Composite ring metadata is persisted to file in JSON format. The metadata has the structure shown below (using example values):
{ "version": 4, "components": [ { "version": 3, "id": "8e56f3b692d43d9a666440a3d945a03a", "replicas": 1 }, { "version": 5, "id": "96085923c2b644999dbfd74664f4301b", "replicas": 1 } ] "component_builder_files": { "8e56f3b692d43d9a666440a3d945a03a": "/etc/swift/region1.builder", "96085923c2b644999dbfd74664f4301b": "/etc/swift/region2.builder", } "serialization_version": 1, "saved_path": "/etc/swift/multi-ring-1.composite", }
version is an integer representing the current version of the composite ring, which increments each time the ring is successfully (re)composed.
components is a list of dicts, each of which describes relevant properties of a component ring
component_builder_files is a dict that maps component ring builder ids to the file from which that component ring builder was loaded.
serialization_version is an integer constant.
saved_path is the path to which the metadata was written.
- Params builder_files
a list of paths to builder files that will be used as components of the composite ring.
- can_part_move(part)¶
Check with all component builders that it is ok to move a partition.
- Parameters
part – The partition to check.
- Returns
True if all component builders agree that the partition can be moved, False otherwise.
- compose(builder_files=None, force=False, require_modified=False)¶
Builds a composite ring using component ring builders loaded from a list of builder files and updates composite ring metadata.
If a list of component ring builder files is given then that will be used to load component ring builders. Otherwise, component ring builders will be loaded using the list of builder files that was set when the instance was constructed.
In either case, if metadata for an existing composite ring has been loaded then the component ring builders are verified for consistency with the existing composition of builders, unless the optional
force
flag if set True.- Parameters
builder_files – Optional list of paths to ring builder files that will be used to load the component ring builders. Typically the list of component builder files will have been set when the instance was constructed, for example when using the load() class method. However, this parameter may be used if the component builder file paths have moved, or, in conjunction with the
force
parameter, if a new list of component builders is to be used.force – if True then do not verify given builders are consistent with any existing composite ring (default is False).
require_modified – if True and
force
is False, then verify that at least one of the given builders has been modified since the composite ring was last built (default is False).
- Returns
An instance of
swift.common.ring.ring.RingData
- Raises
ValueError if the component ring builders are not suitable for composing with each other, or are inconsistent with any existing composite ring, or if require_modified is True and there has been no change with respect to the existing ring.
- classmethod load(path_to_file)¶
Load composite ring metadata.
- Parameters
path_to_file – Absolute path to a composite ring JSON file.
- Returns
an instance of
CompositeRingBuilder
- Raises
IOError – if there is a problem opening the file
ValueError – if the file does not contain valid composite ring metadata
- load_components(builder_files=None, force=False, require_modified=False)¶
Loads component ring builders from builder files. Previously loaded component ring builders will discarded and reloaded.
If a list of component ring builder files is given then that will be used to load component ring builders. Otherwise, component ring builders will be loaded using the list of builder files that was set when the instance was constructed.
In either case, if metadata for an existing composite ring has been loaded then the component ring builders are verified for consistency with the existing composition of builders, unless the optional
force
flag if set True.- Parameters
builder_files – Optional list of paths to ring builder files that will be used to load the component ring builders. Typically the list of component builder files will have been set when the instance was constructed, for example when using the load() class method. However, this parameter may be used if the component builder file paths have moved, or, in conjunction with the
force
parameter, if a new list of component builders is to be used.force – if True then do not verify given builders are consistent with any existing composite ring (default is False).
require_modified – if True and
force
is False, then verify that at least one of the given builders has been modified since the composite ring was last built (default is False).
- Returns
A tuple of (builder files, loaded builders)
- Raises
ValueError if the component ring builders are not suitable for composing with each other, or are inconsistent with any existing composite ring, or if require_modified is True and there has been no change with respect to the existing ring.
- rebalance()¶
Cooperatively rebalances all component ring builders.
This method does not change the state of the composite ring; a subsequent call to
compose()
is required to generate updated compositeRingData
.- Returns
A list of dicts, one per component builder, each having the following keys:
’builder_file’ maps to the component builder file;
’builder’ maps to the corresponding instance of
swift.common.ring.builder.RingBuilder
;’result’ maps to the results of the rebalance of that component i.e. a tuple of: (number_of_partitions_altered, resulting_balance, number_of_removed_devices)
The list has the same order as components in the composite ring.
- Raises
RingBuilderError – if there is an error while rebalancing any component builder.
- save(path_to_file)¶
Save composite ring metadata to given file. See
CompositeRingBuilder
for details of the persisted metadata format.- Parameters
path_to_file – Absolute path to a composite ring file
- Raises
ValueError – if no composite ring has been built yet with this instance
- to_dict()¶
Transform the composite ring attributes to a dict. See
CompositeRingBuilder
for details of the persisted metadata format.- Returns
a composite ring metadata dict
- update_last_part_moves()¶
Updates the record of how many hours ago each partition was moved in all component builders.
- class swift.common.ring.composite_builder.CooperativeRingBuilder(part_power, replicas, min_part_hours, parent_builder)¶
Bases:
swift.common.ring.builder.RingBuilder
A subclass of
RingBuilder
that participates in cooperative rebalance.During rebalance this subclass will consult with its parent_builder before moving a partition. The parent_builder may in turn check with co-builders (including this instance) to verify that none have moved that partition in the last min_part_hours.
- Parameters
part_power – number of partitions = 2**part_power.
replicas – number of replicas for each partition.
min_part_hours – minimum number of hours between partition changes.
parent_builder – an instance of
CompositeRingBuilder
.
- can_part_move(part)¶
Check that in the context of this builder alone it is ok to move a partition.
- Parameters
part – The partition to check.
- Returns
True if the partition can be moved, False otherwise.
- update_last_part_moves()¶
Updates the record of how many hours ago each partition was moved in in this builder.
- swift.common.ring.composite_builder.check_against_existing(old_composite_meta, new_composite_meta)¶
Check that the given builders and their order are the same as that used to build an existing composite ring. Return True if any of the given builders has been modified with respect to its state when the given component_meta was created.
- Parameters
old_composite_meta – a dict of the form returned by
_make_composite_meta()
new_composite_meta – a dict of the form returned by
_make_composite_meta()
- Returns
True if any of the components has been modified, False otherwise.
- Raises
Value Error – if proposed new components do not match any existing components.
- swift.common.ring.composite_builder.check_builder_ids(builders)¶
Check that all builders in the given list have id’s assigned and that no id appears more than once in the list.
- Parameters
builders – a list instances of
swift.common.ring.builder.RingBuilder
- Raises
ValueError if any builder id is missing or repeated
- swift.common.ring.composite_builder.check_for_dev_uniqueness(builders)¶
Check that no device appears in more than one of the given list of builders.
- Parameters
builders – a list of
swift.common.ring.builder.RingBuilder
instances- Raises
ValueError – if the same device is found in more than one builder
- swift.common.ring.composite_builder.check_same_builder(old_component, new_component)¶
Check that the given new_component metadata describes the same builder as the given old_component metadata. The new_component builder does not necessarily need to be in the same state as when the old_component metadata was created to satisfy this check e.g. it may have changed devs and been rebalanced.
- Parameters
old_component – a dict of metadata describing a component builder
new_component – a dict of metadata describing a component builder
- Raises
ValueError – if the new_component is not the same as that described by the old_component
- swift.common.ring.composite_builder.compose_rings(builders)¶
Given a list of component ring builders, perform validation on the list of builders and return a composite RingData instance.
- Parameters
builders – a list of
swift.common.ring.builder.RingBuilder
instances- Returns
a new RingData instance built from the component builders
- Raises
ValueError – if the builders are invalid with respect to each other
- swift.common.ring.composite_builder.is_builder_newer(old_component, new_component)¶
Return True if the given builder has been modified with respect to its state when the given component_meta was created.
- Parameters
old_component – a dict of metadata describing a component ring
new_component – a dict of metadata describing a component ring
- Returns
True if the builder has been modified, False otherwise.
- Raises
ValueError – if the version of the new_component is older than the version of the existing component.
- swift.common.ring.composite_builder.pre_validate_all_builders(builders)¶
Pre-validation for all component ring builders that are to be included in the composite ring. Checks that all component rings are valid with respect to each other.
- Parameters
builders – a list of
swift.common.ring.builder.RingBuilder
instances- Raises
ValueError – if the builders are invalid with respect to each other