OSH Logging, Monitoring, and Alerting¶
Blueprints: 1. osh-monitoring 2. osh-logging-framework
Problem Description¶
OpenStack-Helm currently lacks a centralized mechanism for providing insight into the performance of the OpenStack services and infrastructure components. The log formats of the different components in OpenStack-Helm vary, which makes identifying causes for issues difficult across services. To support operational readiness by default, OpenStack-Helm should include components for logging events in a common format, monitoring metrics at all levels, alerting and alarms for those metrics, and visualization tools for querying the logs and metrics in a single pane view.
Platform Requirements¶
Logging Requirements¶
The requirements for a logging platform include:
All services in OpenStack-Helm log to stdout and stderr by default
Log collection daemon runs on each node to forward logs to storage
Proper directories mounted to retrieve logs from the node
Ability to apply custom metadata and uniform format to logs
Time-series database for logs collected
Backed by highly available storage
Configurable log rotation mechanism
Ability to perform custom queries against stored logs
Single pane visualization capabilities
Monitoring Requirements¶
The requirements for a monitoring platform include:
Time-series database for collected metrics
Backed by highly available storage
Common method to configure all monitoring targets
Single pane visualization capabilities
Ability to perform custom queries against metrics collected
Alerting capabilities to notify operators when thresholds exceeded
Use Cases¶
Logging Use Cases¶
Example uses for centralized logging include:
Record compute instance behavior across nodes and services
Record OpenStack service behavior and status
Find all backtraces for a tenant id’s uuid
Identify issues with infrastructure components, such as RabbitMQ, mariadb, etc
Identify issues with Kubernetes components, such as: etcd, CNI, scheduler, etc
Organizational auditing needs
Visualize logged events to determine if an event is recurring or an outlier
Find all logged events that match a pattern (service, pod, behavior, etc)
Monitoring Use Cases¶
Example OpenStack-Helm metrics requiring monitoring include:
Host utilization: memory usage, CPU usage, disk I/O, network I/O, etc
Kubernetes metrics: pod status, replica availability, job status, etc
Ceph metrics: total pool usage, latency, health, etc
OpenStack metrics: tenants, networks, flavors, floating IPs, quotas, etc
Proactive monitoring of stack traces across all deployed infrastructure
Examples of how these metrics can be used include:
Add or remove nodes depending on utilization
Trigger alerts when desired replicas fall below required number
Trigger alerts when services become unavailable or unresponsive
Identify etcd performance that could lead to cluster instability
Visualize performance to identify trends in traffic or utilization over time
Proposed Change¶
Logging¶
Fluentd, Elasticsearch, and Kibana meet OpenStack-Helm’s logging requirements for capture, storage and visualization of logged events. Fluentd runs as a daemonset on each node and mounts the /var/lib/docker/containers directory. The Docker container runtime engine directs events posted to stdout and stderr to this directory on the host. Fluentd should then declare the contents of that directory as an input stream, and use the fluent-plugin-elasticsearch plugin to apply the Logstash format to the logs. Fluentd will also use the fluentd-plugin-kubernetes-metadata plugin to write Kubernetes metadata to the log record. Fluentd will then forward the results to Elasticsearch, which indexes the logs in a logstash-* index by default. The resulting logs can then be queried directly through Elasticsearch, or they can be viewed via Kibana. Kibana offers a dashboard that can create custom views on logged events, and Kibana integrates well with Elasticsearch by default.
The proposal includes the following:
Helm chart for Fluentd
Helm chart for Elasticsearch
Helm chart for Kibana
All three charts must include sensible configuration values to make the logging platform usable by default. These include: proper input configurations for Fluentd, proper metadata and formats applied to the logs via Fluentd, sensible indexes created for Elasticsearch, and proper configuration values for Kibana to query the Elasticsearch indexes previously created.
Monitoring¶
Prometheus and Grafana meet OpenStack-Helm’s monitoring requirements. The Prometheus monitoring tool provides the ability to scrape targets for metrics over HTTP, and it stores these metrics in Prometheus’s time-series database. The monitoring targets can be discovered via static configuration in Prometheus or through service discovery. Prometheus includes a querying language that provides meaningful queries against the metrics gathered and supports the creation of rules to measure these metrics against for alerting purposes. It also supports a wide range of Prometheus exporters for existing services, including Ceph and OpenStack. Grafana supports Prometheus as a data source, and provides the ability to view the metrics gathered by Prometheus in a single pane dashboard. Grafana can be bootstrapped with dashboards for each target scraped, or the dashboards can be added via Grafana’s web interface directly. To meet OpenStack-Helm’s alerting needs, Alertmanager can be used to interface with Prometheus and send alerts based on Prometheus rule evaluations.
The proposal includes the following:
Helm chart for Prometheus
Helm chart for Alertmanager
Helm chart for Grafana
Helm charts for any appropriate Prometheus exporters
All charts must include sensible configuration values to make the monitoring platform usable by default. These include: static Prometheus configurations for the included exporters, static dashboards for Grafana mounted via configMaps and configurations for Alertmanager out of the box.
Security Impact¶
All services running within the platform should be subject to the security practices applied to the other OpenStack-Helm charts.
Performance Impact¶
To minimize the performance impacts, the following should be considered:
Sane defaults for log retention and rotation policies
Identify opportunities for improving Prometheus’s operation over time
Elasticsearch configured to prevent memory swapping to disk
Elasticsearch configured in a highly available manner with sane defaults
Implementation¶
Assignee(s)¶
- Primary assignees:
srwilker (Steve Wilkerson) portdirect (Pete Birley) lr699s (Larry Rensing)
Work Items¶
Fluentd chart
Elasticsearch chart
Kibana chart
Prometheus chart
Alertmanager chart
Grafana chart
Charts for exporters: kube-state-metrics, ceph-exporter, openstack-exporter?
All charts should follow design approaches applied to all other OpenStack-Helm charts, including the use of helm-toolkit.
All charts require valid and sensible default values to provide operational value out of the box.
Testing¶
Testing should include Helm tests for each of the included charts as well as an integration test in the gate.
Documentation Impact¶
Documentation should be included for each of the included charts as well as documentation detailing the requirements for a usable monitoring platform, preferably with sane default values out of the box.