OSH Logging, Monitoring, and Alerting

Blueprints: 1. osh-monitoring 2. osh-logging-framework

Problem Description

OpenStack-Helm currently lacks a centralized mechanism for providing insight into the performance of the OpenStack services and infrastructure components. The log formats of the different components in OpenStack-Helm vary, which makes identifying causes for issues difficult across services. To support operational readiness by default, OpenStack-Helm should include components for logging events in a common format, monitoring metrics at all levels, alerting and alarms for those metrics, and visualization tools for querying the logs and metrics in a single pane view.

Platform Requirements

Logging Requirements

The requirements for a logging platform include:

  1. All services in OpenStack-Helm log to stdout and stderr by default

  2. Log collection daemon runs on each node to forward logs to storage

  3. Proper directories mounted to retrieve logs from the node

  4. Ability to apply custom metadata and uniform format to logs

  5. Time-series database for logs collected

  6. Backed by highly available storage

  7. Configurable log rotation mechanism

  8. Ability to perform custom queries against stored logs

  9. Single pane visualization capabilities

Monitoring Requirements

The requirements for a monitoring platform include:

  1. Time-series database for collected metrics

  2. Backed by highly available storage

  3. Common method to configure all monitoring targets

  4. Single pane visualization capabilities

  5. Ability to perform custom queries against metrics collected

  6. Alerting capabilities to notify operators when thresholds exceeded

Use Cases

Logging Use Cases

Example uses for centralized logging include:

  1. Record compute instance behavior across nodes and services

  2. Record OpenStack service behavior and status

  3. Find all backtraces for a tenant id’s uuid

  4. Identify issues with infrastructure components, such as RabbitMQ, mariadb, etc

  5. Identify issues with Kubernetes components, such as: etcd, CNI, scheduler, etc

  6. Organizational auditing needs

  7. Visualize logged events to determine if an event is recurring or an outlier

  8. Find all logged events that match a pattern (service, pod, behavior, etc)

Monitoring Use Cases

Example OpenStack-Helm metrics requiring monitoring include:

  1. Host utilization: memory usage, CPU usage, disk I/O, network I/O, etc

  2. Kubernetes metrics: pod status, replica availability, job status, etc

  3. Ceph metrics: total pool usage, latency, health, etc

  4. OpenStack metrics: tenants, networks, flavors, floating IPs, quotas, etc

  5. Proactive monitoring of stack traces across all deployed infrastructure

Examples of how these metrics can be used include:

  1. Add or remove nodes depending on utilization

  2. Trigger alerts when desired replicas fall below required number

  3. Trigger alerts when services become unavailable or unresponsive

  4. Identify etcd performance that could lead to cluster instability

  5. Visualize performance to identify trends in traffic or utilization over time

Proposed Change

Logging

Fluentd, Elasticsearch, and Kibana meet OpenStack-Helm’s logging requirements for capture, storage and visualization of logged events. Fluentd runs as a daemonset on each node and mounts the /var/lib/docker/containers directory. The Docker container runtime engine directs events posted to stdout and stderr to this directory on the host. Fluentd should then declare the contents of that directory as an input stream, and use the fluent-plugin-elasticsearch plugin to apply the Logstash format to the logs. Fluentd will also use the fluentd-plugin-kubernetes-metadata plugin to write Kubernetes metadata to the log record. Fluentd will then forward the results to Elasticsearch, which indexes the logs in a logstash-* index by default. The resulting logs can then be queried directly through Elasticsearch, or they can be viewed via Kibana. Kibana offers a dashboard that can create custom views on logged events, and Kibana integrates well with Elasticsearch by default.

The proposal includes the following:

  1. Helm chart for Fluentd

  2. Helm chart for Elasticsearch

  3. Helm chart for Kibana

All three charts must include sensible configuration values to make the logging platform usable by default. These include: proper input configurations for Fluentd, proper metadata and formats applied to the logs via Fluentd, sensible indexes created for Elasticsearch, and proper configuration values for Kibana to query the Elasticsearch indexes previously created.

Monitoring

Prometheus and Grafana meet OpenStack-Helm’s monitoring requirements. The Prometheus monitoring tool provides the ability to scrape targets for metrics over HTTP, and it stores these metrics in Prometheus’s time-series database. The monitoring targets can be discovered via static configuration in Prometheus or through service discovery. Prometheus includes a querying language that provides meaningful queries against the metrics gathered and supports the creation of rules to measure these metrics against for alerting purposes. It also supports a wide range of Prometheus exporters for existing services, including Ceph and OpenStack. Grafana supports Prometheus as a data source, and provides the ability to view the metrics gathered by Prometheus in a single pane dashboard. Grafana can be bootstrapped with dashboards for each target scraped, or the dashboards can be added via Grafana’s web interface directly. To meet OpenStack-Helm’s alerting needs, Alertmanager can be used to interface with Prometheus and send alerts based on Prometheus rule evaluations.

The proposal includes the following:

  1. Helm chart for Prometheus

  2. Helm chart for Alertmanager

  3. Helm chart for Grafana

  4. Helm charts for any appropriate Prometheus exporters

All charts must include sensible configuration values to make the monitoring platform usable by default. These include: static Prometheus configurations for the included exporters, static dashboards for Grafana mounted via configMaps and configurations for Alertmanager out of the box.

Security Impact

All services running within the platform should be subject to the security practices applied to the other OpenStack-Helm charts.

Performance Impact

To minimize the performance impacts, the following should be considered:

  1. Sane defaults for log retention and rotation policies

  2. Identify opportunities for improving Prometheus’s operation over time

  3. Elasticsearch configured to prevent memory swapping to disk

  4. Elasticsearch configured in a highly available manner with sane defaults

Implementation

Assignee(s)

Primary assignees:

srwilker (Steve Wilkerson) portdirect (Pete Birley) lr699s (Larry Rensing)

Work Items

  1. Fluentd chart

  2. Elasticsearch chart

  3. Kibana chart

  4. Prometheus chart

  5. Alertmanager chart

  6. Grafana chart

  7. Charts for exporters: kube-state-metrics, ceph-exporter, openstack-exporter?

All charts should follow design approaches applied to all other OpenStack-Helm charts, including the use of helm-toolkit.

All charts require valid and sensible default values to provide operational value out of the box.

Testing

Testing should include Helm tests for each of the included charts as well as an integration test in the gate.

Documentation Impact

Documentation should be included for each of the included charts as well as documentation detailing the requirements for a usable monitoring platform, preferably with sane default values out of the box.