2024-06-10 - Alex Song (Inspur IncloudOS)

This document discusses the latency and timeout issues encountered in managing 3000 computing nodeson the Inspur IncloudOS cloud platform. After in-depth investigation and optimization, the problems of request latency and service response timeout on the cloud platform have been alleviated, meeting the requirements for concurrent creation and management of virtual machines in large-scale scenarios.

Problem Description

Service failed to startup:

  1. Nova-api failed to connect to the database.

  2. Nova-api unable to create threads.

  3. Rabbitmq service automatically restarted.

VM failed with concurrent creation:

  1. Rabbitmq message overshock

  2. Specified node to create virtual machine

  3. Nova waiting for port creation timeout

  4. Ovsdb transaction repeated commit causing port creation failure.

Optimized Method

  1. Increase database connection count and memory limit

We ensure that the database service starts up normally and the OpenStack service can apply for sufficient database connections by increasing the number of database connections max_connections and the database memory limit thread_cache_size

[DEFAULT]
max_connections = 100000
thread_cache_size = 10000
  1. Optimize the Rabbitmq configuration of message middleware

Rabbitmq service automatically restarts. We found that there are continuously logs with noproc and handshake_timeout. By adjusting the maximum number of connections and increasing the handshake time configuration, the issue no longer occurs.

[DEFAULT]
maximum=20000

The max number of conns for RabbitMQ is estimated through cacluating the conns of nova and cinder components. We have 3000 compute nodes, 3 control nodes and 15 cinder-volume nodes, We deploy rabbitmq on control nodes and use master-slave mode, the totally conns of RabbitMQ is almost 2w, so we set 2w for RabbitMQ configuration.

  1. Reduce Rabbitmq message backlog.

During the waiting process of message confirmation in the Nova-condutor service, the connection to Rabbitmq was not released, and other coroutines were unable to obtain the connection, resulting in message backlog. We reduce the risk of message backlog by modifying the code, adjusting the message timeout mechanism, and increasing the timeout duration.

  1. Increase the number of allocation candidates returned from Placement

The reason for the failure is that Nova sends request to Placement to obtain RP list with a maximum limit of 1000, and the number of computing nodes in the environment exceeds 1000, some hosts will not return, resulting in scheduling failure. Resolve this issue by modifying the configuration items max_placement_result under nova scheduler.

[scheduler]
max_placement_result = 3000

We spwan to create 3k instance in one request, the placement default return 1000 allocate candidates, so we need to increase the limit to 3000.

  1. Resolve the timeout of port creation

Change the deployment method of OVN SB by deploying 10 relay SB services per control node, and compute nodes can connect a single relay service to reduce port creation time. Before deploying the SB relay, each ovn SB process needed to manage an average of 600+ connections, the CPU usage is often 100% and the requests is slow processing. After adding the relay, each relay process handles about 60 connections, the total relay process is set to 10. After testing, each relay process has a low CPU usage and can process information quickly.

  1. Deployment optimization

We modified the Ansible module on the basis of the OpenStack helm to support user-defined configuration, making it more convenient to modify OpenStack configuration parameters. Additonally, we optimize the load balance problem of Kubeapi in large-scale scenarios, adjust the long connection strategy of Kubelet client to make it randomly reconnect and ensure the overall load balance of all management nodes.

Optimized performance

  1. The success rate of concurrent creation of 3000 virtual machines is 100%.

  2. Querying 50000 virtual machines took 562.44ms.

3. 2000 concurrent ports can be created 100% successfully, with an average creation time of less than 0.2 seconds per port.