Stage 2: Monitor¶
The second stage in the Scaling Journey is Monitor.
Once you have properly configured your cluster to handle scale, you will need to properly monitor it for signs of load stress. Monitoring in OpenStack can be a bit overwhelming and it’s sometimes hard to determine how to meaningfully monitor your deployment to get advance warning of when load is just too high. This page aims to help answer those questions.
Once meaningful monitoring is in place, you are ready to proceed to the third stage of the Scaling Journey: Scale Up.
FAQ¶
Q: How can monitoring help us?
A: Monitoring can help us in:
Understanding the current status of the system, its under/over utilization, its capacity etc
Predicting trend for the system usage, trying to understand if there is any bottleneck
Troubleshooting system issues quickly by analyzing system graphs
Being proactive to upcoming issues
Verifying Scalability issues
Q: Possible metrics which can be used to monitor OpenStack Clusters:
A: A wide range of metrics exist in OpenStack, which include, but not limited to
SLO : Time taken for VM Create/Delete/Rebuild , with Daily/Monthly Average
RabbitMQ: Partitions, Queue and Message status
Hypervisor: Usage of the Memory, CPU and other resources, Number of VMs running/paused/shutdown
C Plane: C Plane resource utilization, Revc Queue Size
API : API response status, API Latency, API Availability
Users Side resource utilization: Number of Projects per user, number of VM per user, number of ports per project etc.
Agent Status: Number of running agents for Nova, Neutron etc
Q: How can I detect that RabbitMQ is a bottleneck?
A: oslo.metrics will introduce monitoring for rpc calls, currently under development. RabbitMQ node CPU and RAM usage is also a indicator that your RabbitMQ cluster is overloaded, if you find CPU or RAM usage high, you should scale up/out RabbitMQ nodes.
Q: How can I detect that database is a bottleneck?
A: oslo.metrics will also integrate oslo.db as the next step after oslo.messaging
Q: How can I track latency issues?
A: If you have a load balancer or proxy in front of your OpenStack API servers (e.g. haproxy, nginx) you can monitor API latencies based on the metrics provided by those services.
Q: How can we track error rates?
A: Monitoring the metrics can be very useful to understand situations when errors hit us. Error rates can be of two types:
When the Operation Fails
API failure due to API service not working
Occurs by keeping a track of the HTTP return codes (3xx, 4xx, 5xx)
In case Messaging queue is full
We can keep a track by using RabbitMQ Exporters which can track the current MQ status
Any other ideas for other MQs
When resource utilization exceeds threshold
Using Node Exporter, cAdvisor, custom exporters, we can keep a track of the resources
When C-Plane resources or Service Agents stop functioning , etc
Can be tracked by gathering current status of the agents on different nodes (Control, Compute, Network, Storage nodes etc)
When the Operation is Slow
API is slow, but operational
We can keep a track of the time lag between request/response at the main junction (for example at haproxy) to understand how slow an API is
Resource CRUD (for example, VM CRUD) is Slow
This involves calculating the time taken for a resource to be completely created/deleted etc. Since resource CRUD can be asynchronous, the API response time may not give an exact response
Q: How do we track performance issues?
A
Time for VM creation/deletion/rebuild increases
Can be calculated by checking the Notification being sent by Nova
Nova sends notification when it starts to create/delete a VM and when the operation ends. Time gap between these 2 events can be used to find the time
Number of VMs scheduled on some hypervisors is much higher than others (improper scheduling)
Can be calculated by using libvirt exporters, which give the status of the VMs being deployed on the compute node, and exposing them to a Prometheus
Alternatives based on other Hypervisor solution can also be created
API response time is high
We can keep a track of the time lag between request/response at the main junction (for example at haproxy) to understand how slow an API is performing.
and so on…
Q: How can I track traffic issues ?
A:
Q: How do I track saturation issues ?
A:
Resources¶
oslo.metrics code and documenation.
Learn about golden signals (latency, traffic, errors, saturation) in the Google SRE book.
Other SIG work on that stage¶
Measurement of MQ behavior through oslo.metrics
Approved spec for oslo.metrics: https://review.opendev.org/#/c/704733/
Code up at https://opendev.org/openstack/oslo.metrics/
0.1.0 initial release done
Get to a 1.0 release
oslo-messaging metrics code https://review.opendev.org/#/c/761848/ (genekuo)
Enable bandit (issue to fix with predictable path for metrics socket ?)
Improve tests to get closer to 100% coverage.