User facing “Failure” Message and Event Viewer

For quite some time, OpenStack services have wanted to be able to send messages, especially error messages, to API end users (by user I do not mean the operator, but the user that is interacting with the API).

If user performs some operation and operation fails or goes in hang state, then there must be some interface for user to see reason for this behavior.

So, this is basically regarding facilitating user to see error messages for asynchronous operations either directly through APIs or through new “Event Viewer” tab in horizon or horizon pluggin.

This is more specific to failed operations and cause of their failure.

https://blueprints.launchpad.net/cinder/+spec/summarymessage

Problem description

If operations like create volume, create snapshot etc fails, user gets no detailed information in case operation fails. Sometime only operation status is updated to error, failed etc without any update to user.

In some case it is worse than this. For ex, 1. In case rabbitmq is inactive and user tries to create volume, horizon hangs on waiting response from API forever. 2. In case rabbitmq is active but recipient service is not running and user tries to create volume, API hangs on waiting response from rabbitMQ forever.

A few resources among all the OpenStack projects handle reporting errors to end users for asynchronous operations and those that do are inconsistent with each other. In addition to a mechanism to enable error reporting in a consistent way across OpenStack, the solution must also be able to accommodate a deployment of Cinder that contains no other OpenStack service.

Use Cases

Motivation for this Blue Print:

  1. To help admin in debugging failed operations with less log filtering.

  2. To notify admin to start services which failed due to some abnormal conditions.

  3. To inform user about failure in case operation fails.

  4. To provide enough information to user about failure.

General Use Cases:

  • Cinder volume/snapshot creation goes to ERROR status due to lack of capacity. (Scheduling error)

  • Cinder add volume to CG fails, how to tell user the volume and group are not on the same backend?

  • Cinder volume goes from attaching to available. why?

  • Volume retype fails.

  • Volume extend fails, I’d like to know why and be able to still use my volume instead of it being in error_extending.

  • ETC..

Proposed change

User can fetch operation failure details through direct API calls or through new horizon tab “Event Viewer”. From CLI, cli client could be used to display same kind of information.

Suggested implementation is based on 2 way approach

  1. Push Information: During user operations, messages will be pushed to database using component specific notifications.

    These messages will be generated and pushed to database by component for operation start, operation completion or operation failure.

    Message generation will be based on eventID to eventMessage mapping using message constant files which keeps different notification messages mapping in it. This way deployer can easily modify notification messages as per requirement.

  2. Pull Information: In case user needs to check operation status, user can pull details using CLI client or Horizon tab.

Results may be shown in tabular way as shown below

Tenant

EventID

NodeName

ReqID

Level

Resource

Time

Summary

Sheel

UKN_ERR

BS-cind1

{…}

Error

Volume

{..}

{….}

Every ‘operation’ initiated by the user has a request ID returned as an HTTP header in the context. These notification messages will be tied to operation request ID. (This request ID will be used for mapping a request to what happened in cinder for that operation.)

Summary message will contain operation specific failure message. For ex, “Volume create operation failed - {Reason of failure}”

Filters:

Results can be filtered depending upon TenantID, HostID, UserID, Operation Outcome/Result(Fail/Pass) etc.

Type of Messages:

  1. API Events : messages for failed operations.

  2. Service Logs: information for failed services either stopped by user or stopped due to any abnormal conditions.

  3. System Logs: any other logs than API and service logs.

Suggested Architecture:

The proposed change is to add a new /v3/<tenant>/messages API resource backed by a messages table in the Cinder DB. This endpoint will return a list of error messages that are intended for the end-user regarding failed asynchronous operations.

In short: * /v3/<tenant>/messages API resource, exposes notifications messages depending upon filters * message_ttl config option that dictates message minimum life in seconds * messages DB table

Questions

None

Alternatives

  • User facing notifications Use the existing notification framework in combination with an AMQP consumer to pull messages off and provide an endpoint for the user. Faults with this approach are that we do not want to display the current information in notifications to the user and it will require many more services as dependencies.

  • Per resource faults This alternative suggests adding a sub-resource to each resource, such as volumes/<volume-id>/faults, similar to Nova’s instance faults. This makes it difficult to poll for messages for more than a single resource or resource type. It also adds significant complexity to the api as every resource must add /faults in order to support messages.

  • Exposing user messages via a separate service (such as Zaqar) This approach suggests storing user messages in another service that the user could query for messages or the service could utilize webhooks to notify the user. One major drawback to this approach is the complexity in writing bindings for the separate service(s) and the need for a separate service as a dependency.

What this specification does not solve

  • State change notifications. This solution does not intend to solve the use-case of alerting users when a volume or any other resource changes state. For example, when a volume changes from creating to available.

REST API impact

New APIs: * GET /v3/<tenant>/messages With filters by attribute. Ex: GET /v3/<tenant>/messages?resource_type=volume * GET /v3/<tenant>/messages/<message-id> * DELETE /v3/<tenant>/messages/<message-id>

Message schema

Message:
  type: object
  required:
  - user_message
  - id
  - project_id
  - request_id
  - event_id
  - created_at
  - message_level
  - expires_at
  properties:
    id:
      type: string
      description: UUID will be stored in 'id' field.
    message_level:
      type: string
      enum:
      - ERROR
      description: The level of the message. In the future we may expand to
      sending information to the user that is not an error.
    user_message:
      type: string
    event_id:
      type: string
      description: Event ID can be used to
      a. update message text at deployer end for some specific situation.
      b. to report errors by user.
      c. to debug fast as it is easy to search where specific eventID is
      used for reporting error.
    resource_uuid:
      type: string
      description: The uuid of the offending resource.
    resource_type:
      type: string
      description: The type of resource this message pertains to.
      For ex, volume, snapshot, backup etc
    request_id:
      type: string
    created_at:
      type: string
    expires_at:
      type: string
      description: After this time the message may no longer exist

Data model impact

New messages table in the DB to store all messages. This table may prove to grow large in a cloud with lots of errors. The admin will be able to utilize the expires_at column to reap messages.

Security impact

Messages must be highly scrutinized before becoming visible to the user in order to avoid any sensitive data from being shown. This will be mitigated by having all user visible messages defined in a single module. The messaging mechanism will assert that any message it will create comes from the sanctioned location.

Notifications impact

None

Other end user impact

None

Performance Impact

None

Other deployer impact

  • New configuration option message_ttl that will dictate the number of seconds after the messages creation time to set the ‘guaranteed_until’ attribute on generated messages.

  • New configuration option message_reap_interval that will dictate the number of seconds between calls to delete old messages. A value of -1 will never run. DocImpact: This option should not be set on a large number of nodes, since too many nodes trying this delete at the same time will cause transaction bouncing and degraded DB performance.

  • New configuration option message_reap_batch_size that dictates the number of expired messages to delete each interval. This allows a deployer to limit DB performance impact by setting a ceiling for the number of messages deleted at a time.

  • The messages table will be potentially large and may be reaped based on the ‘guaranteed_until’ column. Where all messages with a expires_at date earlier than the current time can be safely deleted.

Developer impact

Developers should be aware of use-cases where the user needs information about an error. In these situations, an appropriate user message should be written and creation of the message added in the specific code path(s).

Implementation

Assignee(s)

Primary assignee:

Sheel Rana Alex Meade

Work Items

This whole implementation depends upon message generation, transport, collection, storage and analytics of different failure messages.

  • cinder: Implementation to generate notification messages at the time of failure for all existing operations.

  • cinder: notification listener is required which will serve as basis for handling event messages from different components.

  • cinder: collector is required to collect, validate and store event messages to database.

  • cinder: new API to fetch details form database depending upon filters.

  • cinder: Add pagination to messages

  • cinder-manage: Add mechanism to automatically, and via a cinder-manage command, reap expired messages in the db depending upon ttl value.

  • cinder: Documentation for new API details.

  • cinder: Update “Getting started Guide”.

  • cinder: Database schema preparation to store notification messages.

  • cinder: Need to implement “delete messages as per message life” from database after message expiry time. For ex, if user has set message_ttl to 7 days, then all messages older than 7 days will be purged from database.

  • horizon: Separate tab for cinder to display event messages.

  • cinder-client: cinder cli to communicate with API and fetch event messages.

  • cinder-client: Update to CLI reference Guide.

  • Tempest tests

Implementation Phases:

This whole feature will be implemented in multiple phases:

Phase 1. Basic implementation regarding notification generation and storage into database with “/messages” exposed to view notification messages. This spec targets Phase 1 first, other phases will be implemented after acceptance of phase 1.

Phase 2. Implementation for facilitating admin to configure notification storage like db or zaqar or both. If both RPC/DB are configured by admin, notification message would be stored in zaqar along with storing information to database.

Phase 3. Implementation for consuming information from zaqar directly.

Phase 4. Horizon and CLI implementations to view notifications in more formatted manner.

Phase 5. Handling of some special cases where generation of notifications requires separate handling like rabbitMQ related implementations for showing notifications in case rabbitMQ is in failed state or rabbitMQ recipient is in inactive state.

Dependencies

None

Testing

Tempest tests should be written and run in the gate. It may prove difficult to implement complete functional testing of the feature as messages will not be created unless there is an error, which may be difficult to trigger. However, some operations are easy to trigger failure with unlimited quotas. One example is creating a thick provisioned volume too big to be stored on the backend.

Example Test Cases

# List messages with no messages # Attempt creation of a TOO LARGE volume and verify appropriate scheduling error message is created # List messages with filters, especially resource_type

Documentation Impact

  • REST API documentation

  • New config option, message_ttl (time to live)

  • New config option, message_reap_interval (number of seconds between calls to delete old messages)

  • New config option, message_reap_batch_size (number of messages which could be deleted in one batch)

  • New API policies for messages

References

Mitaka Midcycle discussion

https://etherpad.openstack.org/p/mitaka-cinder-midcycle-user-notifications https://etherpad.openstack.org/p/mitaka-cinder-midcycle-day-1

Kilo Summit Discussion

https://etherpad.openstack.org/p/kilo-cinder-async-reporting

Liberty Summit Discussion (in conjunction with HEAT) -

https://etherpad.openstack.org/p/liberty-cross-project-user-notifications