Cinder Volume Active/Active support - Cleanup

https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support

Right now cinder-volume service can run only in Active/Passive HA fashion.

One of the reasons for this is that we have no concept of a cluster of nodes that handle the same storage back-end, and we assume only one volume service can access a specific storage back-end.

Given this premise, current code handles the cleanup for failed volume services as if no other service is working with resources from his back-end, and that is problematic when there are other volume services working with those resources, as is the case on an Active/Active configuration.

This spec introduces a new cleanup mechanism and modifies current cleanup mechanism so proper cleanup is done regardless of cinder configuration, Active/Passive or Active/Active.

Problem description

Current Cinder code only supports Active/Passive configurations, so the cleanup takes that into account and cleans up resources from ongoing operations accordingly, but that is incompatible with an Active/Active deployment.

The incompatibility comes from the fact that volume services on startup look on the DB for resources that are in the middle of an operation and are from their own storage back-end - detected by the host field - and proceed to clean them up depending on the state they are in. For example a downloading volume will be changed to error since the download was interrupted and we cannot recover from it.

With the new job distribution mechanism the host field will contain the host configuration of the volume service that created the resource, but that resource may now be in use by another volume service from the same cluster, so we cannot just rely on this host field for cleanup, as it may lead to cleaning wrong resources or skipping the ones we should be cleaning.

When we are working with an Active/Active system we cannot just clean all resources from our storage-backend that are in an ongoing state, since they may be legitimate undergoing jobs being handled by other volume services.

We are going to forget for a moment how we are doing the cleanup right now and focus on the different cleanup scenarios we have to cover. One is when a volume service “dies” -by that we mean that it really stops working, or it is fenced- and failover boots another volume service to replace it as if it were the same service -having the same host and cluster configurations-, and the other scenario is when the service dies and no other service takes its place, or the service that takes its place shares the cluster configuration but has a different host.

Those are the cases we have to solve to be able to support Active/Active and Active/Pasive configurations with proper cleanups.

Use Cases

Operators that have hard requirements, SLA or other reasons, to have their cloud operational at all times or have higher throughput requirements will want to have the possibility to configure their deployments with an Active/Active configuration and have proper cleanup of resources when services die.

Proposed change

Since checking for the status and the host field of the resource is no longer enough to know if it needs cleanup -because the host field will be referring to the host configuration of the volume service that created the resource and not the owner of the resource as explained in the Job Distribution specs- we will create a new table to track which service from the cluster is working on each resource.

We’ll call this new table workers and it will include all resources that are being processed with cleanable operations, and therefore would require cleanup if the service that is doing the operation crashed.

When a cleanable job is requested by the API or any of the services -for example a volume deletion can be requested by the API or by the c-vol service during a migration- we will create a new row in the workers table with the resource we are working on and who is working on it. And once the operation has been completed -successfully or unsuccessfully- this row will be deleted to indicate processing has concluded and a cleanup will no longer be needed if the service dies.

We will not be adding a row for non cleanable operations and resources that are used in cleanable operations but won’t require cleanup, as this would create a significant increase in DB operations that would end up affecting performance of all operations.

These workers rows serve as flags for the cleanup mechanism to know it must check that resource in case of a crash and see if it needs cleanup. There can only exist 1 cleanable operation at a time for a given resource.

To ensure that both scenarios mentioned above are taken care of, we will have cleanup code on cinder-volume and Scheduler services.

Cinder-volume service cleanups will be similar to the ones we currently have on startup -init_host method- but with small modifications to use the workers_table so services can tell which resources require cleanup because they were left in the middle of an operation. With this we take care of one of the scenarios, but we still have to consider the case where no replacement volume service comes up with the same host configuration, and for that we will add a mechanism on the scheduler that will take care of requesting other volume service from the cluster, that manage the same backend, to do the cleaning for the fallen service.

The cleanup mechanism implemented on the scheduler will have manual and automatic options, manual option will require the caller to specify which services should be cleaned up using filters, and automatic operation will let the scheduler decide which services should be cleaned up based on their status and how long they have been down.

Automatic cleanup mechanism will consist of a periodic task that will sample services that are down, with a frequency of service_down_time seconds, and will proceed to clean up resources that were left by those services that are down after auto_cleanup_checks x service_down_time seconds have passed since the service went down.

Since we can have multiple Scheduler services and the cinder-volume service all trying to do the cleanup simultaneously, code needs to be able handle these situations.

On one hand, to prevent multiple Schedulers from cleaning the same services’s resources they will be reporting all automatic cleanup operations requested to the cinder-volumes to the other Scheduler services and will ask other scheduler services which services have already been cleaned on service start.

On the other hand, to prevent cleanup concurrency issues if a cleanup is requested on a service that is already being cleaned up, we will issue all cleanup operations with a timestamp indicating that only workers entries before that should be cleaned up, so when a service starts doing the cleanup for a resource it updates the entry an prevents additional cleanup operations on the resource.

Row deletion operations in workers table will be a real deletions in the DB, not soft deletes like we do for other tables, because the number of operations, and therefore of rows, will be quite high and because we will be setting constraints on the rows that would not hold true if we had the same resource multiple times (there are workarounds, but it doesn’t seem to be worth it).

Since these will be big, complex changes, we will not be enabling any kind of automatic cleanup by default, and it will need to be either enabled in the configuration using auto_cleanup_enabled option or triggered using the manual cleanup API -using filters- or the automatic cleanup API.

It will be possible to trigger the automatic cleanup mechanism via the API even when it is disabled, as the disabling only prevents it from being automatically triggered.

It is important to mention that using “reset-state” operation on any resource will remove any existing workers table entry in the DB.

When proceeding with a cleanup we will ensure that no other service is working on that resource (claiming the worker’s entry) and that the data on the workers entry is still valid for the given resource (status matches) since a user may have forcefully issued another action on the resource in the meantime..

Alternatives

There are multiple alternatives to proposed change, the most appealing ones are:

  • Use Tooz with a DLM that allows Leader Election to prevent more than one scheduler from doing cleanup of down services. Downsides to this solution are considerable:

    • Increased dependency on a DLM.

    • Limiting DLM choices since now it needs to have Leader Election functionality.

    • We will still need to let other schedulers know when the leader does cleanups because when electing a new leader will need this information to determine if down services have already been cleaned.

  • Create workers DB entries for every operation on a resource. Disadvantages of this alternative are:

    • Considerable performance impact.

    • Greatly increase cleanup mechanism complexity, as we would need to mark all entries as being processed by the service we are going to clean (this has its own complexity because multiple schedulers could be requesting it or a scheduler and the service itself), then see which of those resources would require cleanup according to the workers table and check if no other service is already working on that resource because a user decided to do a cleanup on his own (for example a force delete on a deleting resource) and if there’s no other service working on the resource and the resource has a status that is cleanable, then do the cleanup. Doing all this without races is quite complicated.

Data model impact

Create new workers table with following fields:

  • id: To uniquely identify each entry and speed up some operations

  • created_at: To mark when the job was started at the API

  • updated_at: To mark when the job was last touched (API, SCH, VOL)

  • deleted_at: Will not be used

  • resource_type: Resource type (Volume, Backup, Snapshot…)

  • resource_id: UUID of the resource

  • status: The status that should be cleaned on service failure

  • service_id: service working on the resource

REST API impact

Two new admin only API endpoint will be created, /workers/cleanup and /workers/auto_cleanup.

For /workers/cleanup endpoint we will be able to supply filtering parameters, but if no arguments are provided cleanup will issue a clean message for all services that are down. But we can restrict which services we want to be cleaned using parameters service_id, cluster_name, host, binary, disabled.

Cleaning specific resources is also possible using resource_type and resource_id parameters.

Cleanup cannot be triggered during a cloud upgrade, but a restarted service will still cleanup it’s own resources during an upgrade.

Both API endpoints will return a dictionary with 2 lists, one with services that have been issued a cleanup request (cleaning) and another list with services that cannot be cleaned right now because there is no alternative service to do the cleanup in that cluster (unavailable), that way the caller can know which services will be cleaned up.

Data returned for each service in the lists are id, name, and state fields.

Security impact

None

Notifications impact

None

Other end user impact

None

Performance Impact

Small impact on cleanable operations since we have to use the workers table to flag that we are working on the resource.

Other deployer impact

None

Developer impact

Any developer that wants to add new resources requiring cleanup or wants add cleanup for the status -new or existing- of an existing resource will have to use the new mechanism to mark the resource as cleanable, add which states are cleanable, and add the cleanup code.

Implementation

Assignee(s)

Primary assignee:

Gorka Eguileor (geguileo)

Other contributors:

Michal Dulko (dulek) Anyone is welcome to help

Work Items

  • Make DB changes to add the new workers table.

  • Implement adding rows to workers table.

  • Change host_init to use an RPC call for the cleanup.

  • Modify Scheduler code to do cleanups.

  • Create devref explaining requirements to add cleanup resources/statuses.

Dependencies

Job Distribution:
  • This depends on the job distribution mechanism so the cleanup can be done by any available service from the same cluster.

Testing

Unittests for new cleanup behavior.

Documentation Impact

Document new configuration option auto_cleanup_enabled and auto_cleanup_checks as well as the cleanup mechanism.

Document behavior of reset-state on Active-Active deployment.

References

General Description for HA A/A: https://review.openstack.org/232599

Job Distribution for HA A/A: https://review.openstack.org/327283