This work is licensed under a Creative Commons Attribution 3.0
Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode
Data hosted on long-term storage systems experience gradual changes in access patterns as part of their information lifecycles. For example, empirical studies by companies such as Facebook show that as image data age beyond their creation times, they become more and more unlikely to be accessed by users, with access rates dropping exponentially at times [1]. Long retention periods, as is the case with data stored on cold storage systems like Swift, increase the possibility of such changes.
Tiering is an important feature provided by many traditional file & block storage systems to deal with changes in data “temperature”. It enables seamless movement of inactive data from high performance storage media to low-cost, high capacity storage media to meet customers’ TCO (total cost of ownership) requirements. As scale-out object storage systems like Swift are starting to natively support multiple media types like SSD, HDD, tape and different storage policies such as replication and erasure coding, it becomes imperative to complement the wide range of available storage tiers (both virtual and physical) with automated data tiering.
Swift users and operators can adapt to changes in access characteristics of objects by transparently converting their storage policies to cater to the goal of matching overall business needs ($/GB, performance, availability) with where and how the objects are stored.
Some examples of how objects can be moved between Swift containers of different storage policies as they age.
[SSD-based container] –> [HDD-based container]
[HDD-based container] –> [Tape-based container]
[Replication policy container] –> [Erasure coded policy container]
In some customer environments, a Swift container may not be the last storage tier. Examples of archival-class stores lower in cost than Swift include specialized tape-based systems [2], public cloud archival solutions such as Amazon Glacier and Google Nearline storage. Analogous to this proposed feature of tiering in Swift, Amazon S3 already has the in-built support to move objects between S3 and Glacier based on user-defined rules. Redhat Ceph has recently added tiering capabilities as well.
The main goal of this document is to propose a tiering feature in Swift that enables seamless movement of objects between containers belonging to different storage policies. It is “seamless” because users will not experience any disruption in namespace, access API, or availability of the objects subject to tiering.
Through new Swift API enhancements, Swift users and operators alike will have the ability to specify a tiering relationship between two containers and the associated data movement rules.
The focus of this proposal is to identify, create and bring together the necessary building blocks towards a baseline tiering implementation natively within Swift. While this narrow scope is intentional, the expectation is that the baseline tiering implementation will lay the foundation and not preclude more advanced tiering features in future.
The following in-progress Swift features (aka specs) have been identified as core dependencies for this tiering proposal.
A few other specs are classified as nice-to-have dependencies, meaning that if they evolve into full implementations we will be able to demonstrate the tiering feature with advanced use cases and capabilities. However, they are not considered mandatory requirements for the first version of tiering.
The proposed tiering implementation depends on several building blocks, some of which are unique to tiering, like the requisite API changes. They will be described in their entirety. Others like symlinks are independent features and have uses beyond tiering. Instead of re-inventing the wheel, the tiering implementation aims to leverage specific constructs that will be available through these in-progress features.
For a quick overview of the tiering implementation, please refer to the Figure (images/tiering_overview.png). It highlights the flow of actions taking place within the proposed tiering engine.
1. Swift client creates a tiering relationship between two Swift containers by marking the source container with appropriate metadata. 2. A background process named tiering-coordinator examines the source container and iterates through its objects. 3. Tiering-coordinator identifies candidate objects for movement and de-stages each object to target container by issuing a copy request to an object server. 4. After an object is copied, tiering-coordinator replaces it by a symlink in the source container pointing to corresponding object in target container.
Swift clients will be able to create a tiering relationship between two containers, i.e., source and target containers, by adding the following metadata to the source container.
X-Container-Tiering-Target: <target_container_name> X-Container-Tiering-Age: <threshold_object_age >
The metadata values can be set during the creation of the source container (PUT) operation or they can be set later as part of a container metadata update (POST) operation. Object age refers to the time elapsed since the object’s creation time (creation time is stored with the object as ‘X-Timestamp’ header).
The user semantics of setting the above container metadata are as follows. When objects in the source container become older than the specified threshold time, they become candidates for being de-staged to the target container. There are no guarantees on when exactly they will be moved or the precise location of the objects at any given time. Swift will operate on them asynchronously and relocate objects based on user-specified tiering rules. Once the tiering metadata is set on the source container, the user can expect levels of performance, reliability, etc. for its objects commensurate with the storage policy of either the source or target container.
One can override the tiering metadata for individual objects in the source container by setting the following per-object metadata,
X-Object-Tiering-Target: <target_container_name> X-Object-Tiering-Age: <object_age_in_minutes>
Presence of tiering metadata on an object will imply that it will take precedence over the tiering metadata set on the hosting container. However, if a container is not tagged with any tiering metadata, the objects inside it will not be considered for tiering regardless of whether they are tagged with any tiering related metadata or not. Also, if the tiering age threshold on the object metadata is lower than the value set on the container, it will not take effect until the container age criterion is met.
An important invariant preserved by the tiering feature is the namespace of objects. As will be explained in later sections, after objects are moved they will be replaced immediately by symlinks that will allow users to continue foreground operations on objects as if no migrations have taken place. Please refer to section 7 on open questions for further commentary on the API topic.
To summarize, here are the steps that a Swift user must perform in order to initiate tiering between objects from a source container (S) to a target container (T) over time.
1. Create containers S and T with desired storage policies, say replication and erasure coding respectively 2. Set the tiering-related metadata (X-Container-Tiering-) on container S as described earlier in this section. 3. Deposit objects into container S. 4. If needed, override the default container settings for individual objects inside container S by setting object metadata (X-Object-Tiering-).
It will also be possible to create cascading tiering relationships between more than two containers. For example, a sequence of tiering relationships between containers C1 -> C2 -> C3 can be established by setting appropriate tiering metadata on C1 and C2. When an object is old enough to be moved from C1, it will be deposited in C2. The timer will then start on the moved object in C2 and depending on the age settings on C2, the object will eventually be migrated to C3.
The tiering-coordinator is a background process similar to container-sync, container-reconciler and other container-* processes running on each container server. We can potentially re-use one of the existing container processes, specifically either container-sync or container-reconciler to perform the job of tiering-coordinator, but for the purposes of this discussion it will be assumed that it is a separate process.
The key actions performed by tiering-coordinator are
We will discuss (a) and (b) in this section and cover (c) and (d) in subsequent sections. Note that in the first version of tiering, only one metric <object age> will be used to determine the eligibility of an object for migration.
The tiering-coordinator performs its operations in a series of rounds. In each round, it iterates through containers whose SQLite DBs it has direct access to on the container server it is running on. It checks if the container has the right X-Container-Tiering-* metadata. If present, it starts the scanning process to identify candidate objects. The scanning process leverages a convenient (but not necessary) property of the container DB that objects are listed in the chronological order of their creation times. That is, the first index in the container DB points to the object with oldest creation time, followed by next younger object and so on. As such, the scanning process described below is optimized for the object age criterion chosen for tiering v1 implementation. For extending to other tiering metrics, we refer the reader to section 6.1 for discussion.
Each container DB will have two persistent markers to track the progress of tiering – tiering_sync_start and tiering_sync_end. The marker tiering_sync_start refers to the starting index in the container DB upto which objects have already been processed. The marker tiering_sync_end refers to the index beyond which objects have not yet been considered for tiering. All the objects that fall between the two markers are the ones for which tiering is currently in progress. Note that the presence of persistent markers in the container DB helps with quickly resuming from previous work done in the event of container server crash/reboot.
When a container is selected for tiering for the first time, both the markers are initialized to -1. If the first object is old enough to meet the X-Container-Tiering-Age criterion, tiering_sync_start is set to 0. Then the second marker tiering_sync_end is advanced to an index that is lesser than the two values - (i) tiering_sync_start + tier_max_objects_per_round (latter will be a configurable value in /etc/swift/container.conf) or (ii) largest index in the container DB whose corresponding object meets the tiering age criterion.
The above marker settings will ensure two invariants. First, all objects between (and including) tiering_sync_start and tiering_sync_end are candidates for moving to the target container. Second, it will guarantee that the number of objects processed on the container in a single round is bound by the configuration parameter (tier_max_objects_per_round, say = 200). This will ensure that the coordinator process will round robin effectively amongst all containers on the server per round without spending undue amount of time on only a few.
After the markers are fixed, tiering-coordinator will issue a copy request for each object within the range. When the copy requests are completed, it updates tiering_sync_start = tiering_sync_end and moves on to the next container. When tiering-coordinator re-visits the same container after completing the current round, it restarts the scanning routine described above from tiering_sync_start = tiering_sync_end (except they are not both -1 this time).
In a typical Swift cluster, each container DB is replicated three times and resides on multiple container servers. Therefore, without proper synchronization, tiering-coordinator processes can end up conflicting with each other by processing the same container and same objects within. This can potentially lead to race conditions with non-deterministic behavior. We can overcome this issue by adopting the approach of divide-and-conquer employed by container-sync process. The range of object indices between (tiering_sync_start, tiering_sync_end) can be initially split up into as many disjoint regions as the number of tiering-coordinator processes operating on the same container. As they work through the object indices, each process might additionally complete others’ portions depending on the collective progress. For a detailed description of how container-sync processes implicitly communicate and make group progress, please refer to [7].
For each candidate object that the tiering-coordinator deems eligible to move to the target container, it issues an ‘object copy’ request using an API call supported by the object servers. The API call will map to a method used by object-transferrer daemons running on the object servers. The tiering-coordinator can select any of the object servers (by looking up the ring datastructure corresponding to the object in source container policy) as a destination for the request.
The object-transferrer daemon is supposed to be optimized for converting an object from one storage policy to another. As per the ‘Changing policies’ spec, the object-transferrer daemon will be equipped with the right techniques to move objects between Replication -> EC, EC -> EC, etc. Alternatively, in the absence of object-transferrer, the tiering coordinator can simply make use of the server-side ‘COPY’ API that vanilla Swift exposes to regular clients. It can send the COPY request to a swift proxy server to clone the source object into the target container. The proxy server will perform the copy by first reading in (GET request) the object from any of the source object servers and creating a copy (PUT request) of the object in the target object servers. While this will work correctly for the purposes of the tiering coordinator, making use of the object-transferrer interface is likely to be a better option. Leveraging the specialized code in object-transferrer through a well-defined interface for copying an object between two different storage policy containers will make the overall tiering process efficient.
Here is an example interface represented by a function call in the object-transferrer code:
def copy_object(source_obj_path, target_obj_path)
The above method can be a wrapper over similar functionality used by the object-transferrer daemon. The tiering-coordinator will use this interface to call the function through a HTTP call.
copy_object(/A/S/O, /A/T/O)
where S is the source container and T is the target container. Note that the object name in the target container will be the same as in the source container.
Upon receiving the copy request, the object server will first check if the source path is a symlink object. If it is a symlink, it will respond with an error to the tiering-coordinator to indicate that a symlink already exists. This behavior will ensure idempotence and guard against situations where tiering-coordinator crashes and retries a previously completed object copy request. Also, it avoids tiering for sparse objects such as symlinks created by users. Secondly, the object server will check if the source object has tiering metadata in the form of X-Object-Tiering-* that overrides the default tiering settings on the source container. It may or may not perform the object copy depending on the result.
After an object is successfully copied to the destination container, the tiering-coordinator will issue a ‘symlink create’ request to proxy server to replace the source object by a reference to the destination object. Waiting until the object copy is completed before replacing it by a symlink ensures safety in case of failures. The system could end up with an extra target object without a symlink pointing to it, but not the converse which constitutes data loss. Note that the symlink feature is currently work-in-progress and will also be available as an external API to swift clients.
When the symlink is created by the tiering-coordinator, it will need to ensure that the original object’s ‘X-Timestamp’ value is preserved on the symlink object. Therefore, it is proposed that in the symlink creation request, the original time field can be provided (tiering-coordinator can quickly read the original values from container DB entry) as object user metadata, which is translated internally to a special sysmeta field by the symlink middleware. On subsequent user requests, the sysmeta field storing the correct creation timestamp will be sent to the user.
With the symlink successfully created, Swift users can continue to issue object requests like GET, PUT to the original namespace /Account/Container/Object. The Symlink middleware will ensure that the swift users do not notice the presence of a symlink object unless a query parameter ‘?symlink=true’ [3] is explicitly provided with the object request.
Users can also continue to read and update object metadata as before. It is not entirely clear at the time of this writing if the symlink object will store a copy of user metadata in its own extended attributes or if it will fetch the metadata from the referenced object for every HEAD/GET on the object. We will defer to whichever implementation that the symlink feature chooses to provide.
An interesting race condition is possible due to the time window between object copy request and symlink creation. If there is an interim PUT request issued by a swift user between the two, it will be overwritten by the internal symlink created by the tiering-coordinator. This is an incorrect behavior that we need to protect against. We can use the same technique [8] (with help of a second vector timestamp) that container-reconciler uses to resolve a similar race condition. The tiering-coordinator, at the time of symlink creation, can detect the race condition and undo the COPY request. It will have to delete the object that was created in the destination container. Though this is wasted work in the face of such race conditions, we expect them to be rare scenarios. If the user conceives tiering rules properly, there ought to be little to no foreground traffic for the object that is being tiered.
The first version of tiering implementation will be heavily tailored (especially the scanning mechanism of tiering-coordinator) to the object age criterion. The convenient property of container DBs that store objects in the same order as they are created/overwritten lends to very efficient linear scanning for candidate objects.
In the future, we should be able to support advanced criteria such as read frequency counts, object size, metadata-based selection, etc. For example, consider the following hypothetical criterion:
“Tier objects from container S to container T if older than 1 month AND size > 1GB AND tagged with metadata ‘surveillance-video’”
When the metadata search feature [5] is available in Swift, tiering-coordinator should be able to run queries to quickly retrieve the set of object names that match ad-hoc criteria on both user and system metadata. As the metadata search feature evolves, we should be able to leverage it to add custom metadata such as read counts, etc for our purposes.
The first implementation of tiering will only support object movement between Swift containers. In order to establish a tiering relationship between a swift container and an external storage backend, the backend must be mounted in Swift as a native container through the DiskFile API or other integration mechanisms. For instance, a target container fully hosted on GlusterFS or Seagate Kinetic drives can be created through Swift-on-file or Kinetic DiskFile implementations respectively.
The Swift community believes that a similar integration approach is necessary to support external storage systems as tiering targets. There is already work underway to integrate tape-based systems in Swift. In the same vein, future work is needed to integrate external systems like Amazon Glacier or vendor archival products via DiskFile drivers or other means.
This section is structured as a series of questions and possible answers. With more feedback from the swift community, the open issues will be resolved and merged into the main document.
Q1: Can the target container exist on a different account than the source container?
Ans: The proposed API assumes that the target container is always on the same account as the source container. If this restriction is lifted, the proposed API needs to be modified appropriately.
Q2: When the client sets the tiering metadata on the source container, should the target container exist at that time? What if the user has no permissions on the target container? When is all the error checking done?
Ans: The error checking can be deferred to the tiering-coordinator process. The background process, upon detecting that the target container is unavailable can skip performing any tiering activity on the source container and move on to the next container. However, it might be better to detect errors in the client path and report early. If the latter approach is chosen, middleware functionality is needed to sanity check tiering metadata set on containers.
Q3: How is the target container presented to the client? Would it be just like any other container with read/write permissions?
Ans: The target container will be just like any other container. The client is responsible for manipulating the contents in the target container correctly. In particular, it should be aware that there might be symlinks in source container pointing to target objects. Deletions or overwrites of objects directly using the target container namespace could render some symlinks useless or obsolete.
Q4: What is the behavior when conflicting tiering metadata are set over a period of time. For example, if the tiering age threshold is increased on a container with a POST metadata operation, will previously de-staged objects be brought back to the source container to match the new tiering rule?
Ans: Perhaps not. The new tiering metadata should probably only be applied to objects that have not yet been processed by tiering-coordinator. Previous actions performed by tiering-coordinator based on older metadata need not be reversed.
Q5: When a user issues a PUT operation to an object that has been de-staged to the target container earlier, what is the behavior?
Ans: The default symlink behavior should apply but it’s not clear what it will be. Will an overwrite PUT cause the symlink middleware to delete both the symlink and the object being pointed to?
Q6: When a user issues a GET operation to an object that has been de-staged to the target container earlier, will it be promoted back to source container?
Ans: The proposed implementation does not promote objects back to an upper tier seamless to the user. If needed, such a behavior can be easily added with help of a tiering middleware in the proxy server.
Q7: There is a mention of the ability to set cascading tiering relationships between multiple containers, C1 -> C2 -> C3. What if there is a cycle in this relationship graph?
Ans: A cycle should be prevented, else we can run into atleast one complicated situation where a symlink might be pointing to an object on the same container with the same name, thereby overwriting the symlink ! It is possible to detect cycles at the time of tiering metadata creation in the client path with a tiering-specific middleware that is entrusted with the cycle detection by iterating through existing tiering relationships.
Q8: Are there any unexpected interactions of tiering with existing or new features like SLO/DLO, encryption, container sharding, etc ?
Ans: SLO and DLO segments should continue to work as expected. If an object server receives an object copy request for a SLO manifest object from a tiering-coordinator, it will iteratively perform the copy for each constituent object. Each constituent object will be replaced by a symlink. Encryption should also work correctly as it is almost entirely orthogonal to the tiering feature. Each object is treated as an opaque set of bytes by the tiering engine and it does not pay any heed to whether the object is cipher text or not. Dealing with container sharding might be tricky. Tiering-coordinator expects to linearly walk through the indices of a container DB. If the container DB is fragmented and stored in many different container servers, the scanning process can get complicated. Any ideas there?