Maintenance tasks¶

This chapter is intended for OpenStack-Ansible specific maintenance tasks.

Galera cluster maintenance¶

Routine maintenance includes gracefully adding or removing nodes from the cluster without impacting operation and also starting a cluster after gracefully shutting down all nodes.

MySQL instances are restarted when creating a cluster, when adding a node, when the service is not running, or when changes are made to the /etc/mysql/my.cnf configuration file.

Verify cluster status¶

Compare the output of the following command with the following output. It should give you information about the status of your cluster.

# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)

node2_galera_container-49a47d25 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     7
wsrep_cluster_size        1
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

In this example, only one node responded.

Gracefully shutting down the MariaDB service on all but one node allows the remaining operational node to continue processing SQL requests. When gracefully shutting down multiple nodes, perform the actions sequentially to retain operation.

Start a cluster¶

Gracefully shutting down all nodes destroys the cluster. Starting or restarting a cluster from zero nodes requires creating a new cluster on one of the nodes.

Start a new cluster on the most advanced node. Change to the playbooks directory and check the seqno value in the grastate.dat file on all of the nodes:

# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat"
node2_galera_container-49a47d25 | success | rc=0 >>
# GALERA saved state version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   31
cert_index:

node3_galera_container-3ea2cbd3 | success | rc=0 >>
# GALERA saved state version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   31
cert_index:

node4_galera_container-76275635 | success | rc=0 >>
# GALERA saved state version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   31
cert_index:

In this example, all nodes in the cluster contain the same positive seqno values as they were synchronized just prior to graceful shutdown. If all seqno values are equal, any node can start the new cluster.

## for init
# /etc/init.d/mysql start --wsrep-new-cluster
## for systemd
# systemctl set-environment _WSREP_NEW_CLUSTER='--wsrep-new-cluster'
# systemctl start mysql
# systemctl set-environment _WSREP_NEW_CLUSTER=''

Please also have a look at upstream starting a cluster page

This can also be done with the help of ansible using the shell module:

# ansible galera_container -m shell -a "/etc/init.d/mysql start --wsrep-new-cluster" --limit galera_container[0]

This command results in a cluster containing a single node. The wsrep_cluster_size value shows the number of nodes in the cluster.

node2_galera_container-49a47d25 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (111)

node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     1
wsrep_cluster_size        1
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

Restart MariaDB on the other nodes (replace [0] from previous ansible command with [1:]) and verify that they rejoin the cluster.

node2_galera_container-49a47d25 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     3
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node3_galera_container-3ea2cbd3 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     3
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     3
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

Galera cluster recovery¶

Run the openstack.osa.galera_server playbook using the galera_force_bootstrap variable to automatically recover a node or an entire environment.

Run the following Ansible command to show the failed nodes:

# openstack-ansible openstack.osa.galera_server -e galera_force_bootstrap=True --tags galera_server-config

You can additionally define a different bootstrap node through galera_server_bootstrap_node variable, in case current bootstrap node is in desynced/broken state. You can check what node is currently selected for bootstrap using this ad-hoc:

root@aio1:/opt/openstack-ansible# ansible -m debug -a var="groups['galera_all'][0]" localhost

The cluster comes back online after completion of this command. If this fails, please review restarting the cluster and recovering the primary component in the galera documentation as they’re invaluable for a full cluster recovery.

Recover a single-node failure¶

If a single node fails, the other nodes maintain quorum and continue to process SQL requests.

Change to the playbooks directory and run the following Ansible command to determine the failed node:

# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server through
socket '/var/run/mysqld/mysqld.sock' (111)

node2_galera_container-49a47d25 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     17
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     17
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

In this example, node 3 has failed.

Restart MariaDB on the failed node and verify that it rejoins the cluster.
If MariaDB fails to start, run the mysqld command and perform further analysis on the output. As a last resort, rebuild the container for the node.

Recover a multi-node failure¶

When all but one node fails, the remaining node cannot achieve quorum and stops processing SQL requests. In this situation, failed nodes that recover cannot join the cluster because it no longer exists.

Run the following Ansible command to show the failed nodes:

# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node2_galera_container-49a47d25 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (111)

node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (111)

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     18446744073709551615
wsrep_cluster_size        1
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      non-Primary

In this example, nodes 2 and 3 have failed. The remaining operational server indicates non-Primary because it cannot achieve quorum.

Run the following command to rebootstrap the operational node into the cluster:

# mysql -e "SET GLOBAL wsrep_provider_options='pc.bootstrap=yes';"
node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     15
wsrep_cluster_size        1
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (111)

node2_galera_container-49a47d25 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (111)

The remaining operational node becomes the primary node and begins processing SQL requests.

Restart MariaDB on the failed nodes and verify that they rejoin the cluster:

# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node3_galera_container-3ea2cbd3 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     17
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node2_galera_container-49a47d25 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     17
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     17
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

If MariaDB fails to start on any of the failed nodes, run the mysqld command and perform further analysis on the output. As a last resort, rebuild the container for the node.

Recover a complete environment failure¶

Restore from backup if all of the nodes in a Galera cluster fail (do not shutdown gracefully). Change to the playbook directory and run the following command to determine if all nodes in the cluster have failed:

# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat"
node3_galera_container-3ea2cbd3 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   -1
cert_index:

node2_galera_container-49a47d25 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   -1
cert_index:

node4_galera_container-76275635 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid:    338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno:   -1
cert_index:

All the nodes have failed if mysqld is not running on any of the nodes and all of the nodes contain a seqno value of -1.

If any single node has a positive seqno value, then that node can be used to restart the cluster. However, because there is no guarantee that each node has an identical copy of the data, we do not recommend to restart the cluster using the --wsrep-new-cluster command on one node.

Rebuild a container¶

Recovering from certain failures require rebuilding one or more containers.

Disable the failed node on the load balancer.

Note

Do not rely on the load balancer health checks to disable the node. If the node is not disabled, the load balancer sends SQL requests to it before it rejoins the cluster and cause data inconsistencies.
Destroy the container and remove MariaDB data stored outside of the container:
```
# openstack-ansible openstack.osa.containers_lxc_destroy \
-l node3_galera_container-3ea2cbd3
```
In this example, node 3 failed.
Run the host setup playbook to rebuild the container on node 3:
```
# openstack-ansible oopenstack.osa.containers_lxc_create -l node3 \
-l node3_galera_container-3ea2cbd3
```
The playbook restarts all other containers on the node.

Run the infrastructure playbook to configure the container specifically on node 3:

# openstack-ansible openstack.osa.setup_infrastructure \
--limit node3_galera_container-3ea2cbd3

Warning

The new container runs a single-node Galera cluster, which is a dangerous state because the environment contains more than one active database with potentially different data.

# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node3_galera_container-3ea2cbd3 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     1
wsrep_cluster_size        1
wsrep_cluster_state_uuid  da078d01-29e5-11e4-a051-03d896dbdb2d
wsrep_cluster_status      Primary

node2_galera_container-49a47d25 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     4
wsrep_cluster_size        2
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     4
wsrep_cluster_size        2
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

Restart MariaDB in the new container and verify that it rejoins the cluster.

Note

In larger deployments, it may take some time for the MariaDB daemon to start in the new container. It will be synchronizing data from the other MariaDB servers during this time. You can monitor the status during this process by tailing the /var/log/mysql_logs/galera_server_error.log log file.

Lines starting with WSREP_SST will appear during the sync process and you should see a line with WSREP: SST complete, seqno: <NUMBER> if the sync was successful.

# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node2_galera_container-49a47d25 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     5
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node3_galera_container-3ea2cbd3 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     5
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

node4_galera_container-76275635 | success | rc=0 >>
Variable_name             Value
wsrep_cluster_conf_id     5
wsrep_cluster_size        3
wsrep_cluster_state_uuid  338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status      Primary

Enable the previously failed node on the load balancer.

RabbitMQ cluster maintenance¶

A RabbitMQ broker is a logical grouping of one or several Erlang nodes with each node running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings, and runtime parameters. A collection of nodes is often referred to as a cluster. For more information on RabbitMQ clustering, see RabbitMQ cluster.

Within OpenStack-Ansible, all data and states required for operation of the RabbitMQ cluster is replicated across all nodes including the message queues providing high availability. RabbitMQ nodes address each other using domain names. The hostnames of all cluster members must be resolvable from all cluster nodes, as well as any machines where CLI tools related to RabbitMQ might be used. There are alternatives that may work in more restrictive environments. For more details on that setup, see Inet Configuration.

Note

There is currently an Ansible bug in regards to HOSTNAME. If the host .bashrc holds a var named HOSTNAME, the container where the lxc_container module attaches will inherit this var and potentially set the wrong $HOSTNAME. See the Ansible fix which will be released in Ansible version 2.3.

Create a RabbitMQ cluster¶

RabbitMQ clusters can be formed in two ways:

Manually with rabbitmqctl
Declaratively (list of cluster nodes in a config, with rabbitmq-autocluster, or rabbitmq-clusterer plugins)

Note

RabbitMQ brokers can tolerate the failure of individual nodes within the cluster. These nodes can start and stop at will as long as they have the ability to reach previously known members at the time of shutdown.

There are two types of nodes you can configure: disk and RAM nodes. Most commonly, you will use your nodes as disk nodes (preferred). Whereas RAM nodes are more of a special configuration used in performance clusters.

RabbitMQ nodes and the CLI tools use an erlang cookie to determine whether or not they have permission to communicate. The cookie is a string of alphanumeric characters and can be as short or as long as you would like.

Note

The cookie value is a shared secret and should be protected and kept private.

The default location of the cookie on *nix environments is /var/lib/rabbitmq/.erlang.cookie or in $HOME/.erlang.cookie.

Tip

While troubleshooting, if you notice one node is refusing to join the cluster, it is definitely worth checking if the erlang cookie matches the other nodes. When the cookie is misconfigured (for example, not identical), RabbitMQ will log errors such as “Connection attempt from disallowed node” and “Could not auto-cluster”. See clustering for more information.

To form a RabbitMQ Cluster, you start by taking independent RabbitMQ brokers and re-configuring these nodes into a cluster configuration.

Using a 3 node example, you would be telling nodes 2 and 3 to join the cluster of the first node.

Join the cluster, then restart the application:

rabbit2$ rabbitmqctl stop_app
Stopping node rabbit@rabbit2 ...done.
rabbit2$ rabbitmqctl join_cluster rabbit@rabbit1
Clustering node rabbit@rabbit2 with [rabbit@rabbit1] ...done.
rabbit2$ rabbitmqctl start_app
Starting node rabbit@rabbit2 ...done.

Check the RabbitMQ cluster status¶

Run rabbitmqctl cluster_status from either node.

You will see rabbit1 and rabbit2 are both running as before.

The difference is that the cluster status section of the output, both nodes are now grouped together:

rabbit1$ rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1]}]
...done.

To add the third RabbitMQ node to the cluster, repeat the above process by stopping the RabbitMQ application on the third node.

Join the cluster, and restart the application on the third node.

Execute rabbitmq cluster_status to see all 3 nodes:

rabbit1$ rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit3]}]},
 {running_nodes,[rabbit@rabbit3,rabbit@rabbit2,rabbit@rabbit1]}]
...done.

Stop and restart a RabbitMQ cluster¶

To stop and start the cluster, keep in mind the order in which you shut the nodes down. The last node you stop, needs to be the first node you start. This node is the master.

If you start the nodes out of order, you could run into an issue where it thinks the current master should not be the master and drops the messages to ensure that no new messages are queued while the real master is down.

RabbitMQ and mnesia¶

Mnesia is a distributed database that RabbitMQ uses to store information about users, exchanges, queues, and bindings. Messages, however are not stored in the database.

For more information about Mnesia, see the Mnesia overview.

To view the locations of important Rabbit files, see File Locations.

Repair a partitioned RabbitMQ cluster for a single-node¶

Invariably due to something in your environment, you are likely to lose a node in your cluster. In this scenario, multiple LXC containers on the same host are running Rabbit and are in a single Rabbit cluster.

If the host still shows as part of the cluster, but it is not running, execute:

# rabbitmqctl start_app

However, you may notice some issues with your application as clients may be trying to push messages to the un-responsive node. To remedy this, forget the node from the cluster by executing the following:

Ensure RabbitMQ is not running on the node:
```
# rabbitmqctl stop_app
```

On the Rabbit2 node, execute:

# rabbitmqctl forget_cluster_node rabbit@rabbit1

By doing this, the cluster can continue to run effectively and you can repair the failing node.

Important

Watch out when you restart the node, it will still think it is part of the cluster and will require you to reset the node. After resetting, you should be able to rejoin it to other nodes as needed.

rabbit1$ rabbitmqctl start_app
Starting node rabbit@rabbit1 ...

Error: inconsistent_cluster: Node rabbit@rabbit1 thinks it's clustered
       with node rabbit@rabbit2, but rabbit@rabbit2 disagrees

rabbit1$ rabbitmqctl reset
Resetting node rabbit@rabbit1 ...done.
rabbit1$ rabbitmqctl start_app
Starting node rabbit@mcnulty ...
...done.

Repair a partitioned RabbitMQ cluster for a multi-node cluster¶

The same concepts apply to a multi-node cluster that exist in a single-node cluster. The only difference is that the various nodes will actually be running on different hosts. The key things to keep in mind when dealing with a multi-node cluster are:

When the entire cluster is brought down, the last node to go down must be the first node to be brought online. If this does not happen, the nodes will wait 30 seconds for the last disc node to come back online, and fail afterwards.

If the last node to go offline cannot be brought back up, it can be removed from the cluster using the forget_cluster_node command.
If all cluster nodes stop in a simultaneous and uncontrolled manner, (for example, with a power cut) you can be left with a situation in which all nodes think that some other node stopped after them. In this case you can use the force_boot command on one node to make it bootable again.

Consult the rabbitmqctl manpage for more information.

Migrate between HA and Quorum queues¶

In the 2024.1 (Caracal) release OpenStack Ansible switches to use RabbitMQ Quorum Queues by default, rather than the legacy High Availability classic queues. Migration to Quorum Queues can be performed at upgrade time, but may result in extended control plane downtime as this requires all OpenStack services to be restarted with their new configuration.

In order to speed up the migration, the following playbooks can be run to migrate either to or from Quorum Queues, whilst skipping package install and other configuration tasks. These tasks are available from the 2024.1 release onwards.

$ openstack-ansible openstack.osa.rabbitmq_server --tags rabbitmq-config
$ openstack-ansible openstack.osa.setup_openstack --tags common-mq,post-install

In order to take advantage of these steps, we suggest setting oslomsg_rabbit_quorum_queues to False before upgrading to 2024.1. Then, once you have upgraded, set oslomsg_rabbit_quorum_queues back to the default of True and run the playbooks above.

Running ad-hoc Ansible plays¶

Being familiar with running ad-hoc Ansible commands is helpful when operating your OpenStack-Ansible deployment. For a review, we can look at the structure of the following ansible command:

$ ansible example_group -m shell -a 'hostname'

This command calls on Ansible to run the example_group using the -m shell module with the -a argument which is the hostname command. You can substitute example_group for any groups you may have defined. For example, if you had compute_hosts in one group and infra_hosts in another, supply either group name and run the command. You can also use the * wild card if you only know the first part of the group name, for instance if you know the group name starts with compute you would use compute_h*. The -m argument is for module.

Modules can be used to control system resources or handle the execution of system commands. For more information about modules, see Module Index and About Modules.

If you need to run a particular command against a subset of a group, you could use the limit flag -l. For example, if a compute_hosts group contained compute1, compute2, compute3, and compute4, and you only needed to execute a command on compute1 and compute4 you could limit the command as follows:

$ ansible example_group -m shell -a 'hostname' -l compute1,compute4

Note

Each host is comma-separated with no spaces.

Note

Run the ad-hoc Ansible commands from the openstack-ansible/playbooks directory.

For more information, see Inventory and Patterns.

Running the shell module¶

The two most common modules used are the shell and copy modules. The shell module takes the command name followed by a list of space delimited arguments. It is almost like the command module, but runs the command through a shell (/bin/sh) on the remote node.

For example, you could use the shell module to check the amount of disk space on a set of Compute hosts:

$ ansible compute_hosts -m shell -a 'df -h'

To check on the status of your Galera cluster:

$ ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"

When a module is being used as an ad-hoc command, there are a few parameters that are not required. For example, for the chdir command, there is no need to chdir=/home/user ls when running Ansible from the CLI:

$ ansible compute_hosts -m shell -a 'ls -la /home/user'

For more information, see shell - Execute commands in nodes.

Running the copy module¶

The copy module copies a file on a local machine to remote locations. To copy files from remote locations to the local machine you would use the fetch module. If you need variable interpolation in copied files, use the template module. For more information, see copy - Copies files to remote locations.

The following example shows how to move a file from your deployment host to the /tmp directory on a set of remote machines:

$ ansible remote_machines -m copy -a 'src=/root/FILE '\
'dest=/tmp/FILE'

The fetch module gathers files from remote machines and stores the files locally in a file tree, organized by the hostname from remote machines and stores them locally in a file tree, organized by hostname.

Note

This module transfers log files that might not be present, so a missing remote file will not be an error unless fail_on_missing is set to yes.

The following examples shows the nova-compute.log file being pulled from a single Compute host:

root@libertylab:/opt/rpc-openstack/openstack-ansible/playbooks# ansible compute_hosts -m fetch -a 'src=/var/log/nova/nova-compute.log dest=/tmp'
aio1 | success >> {
    "changed": true,
    "checksum": "865211db6285dca06829eb2215ee6a897416fe02",
    "dest": "/tmp/aio1/var/log/nova/nova-compute.log",
    "md5sum": "dbd52b5fd65ea23cb255d2617e36729c",
    "remote_checksum": "865211db6285dca06829eb2215ee6a897416fe02",
    "remote_md5sum": null
}

root@libertylab:/opt/rpc-openstack/openstack-ansible/playbooks# ls -la /tmp/aio1/var/log/nova/nova-compute.log
-rw-r--r-- 1 root root 2428624 Dec 15 01:23 /tmp/aio1/var/log/nova/nova-compute.log

Using tags¶

Tags are similar to the limit flag for groups, except tags are used to only run specific tasks within a playbook. For more information on tags, see Tags and Understanding ansible tags.

Ansible forks¶

The default MaxSessions setting for the OpenSSH Daemon is 10. Each Ansible fork makes use of a session. By default, Ansible sets the number of forks to 5. However, you can increase the number of forks used in order to improve deployment performance in large environments.

Note that more than 10 forks will cause issues for any playbooks which use delegate_to or local_action in the tasks. It is recommended that the number of forks are not raised when executing against the control plane, as this is where delegation is most often used.

The number of forks used may be changed on a permanent basis by including the appropriate change to the ANSIBLE_FORKS in your .bashrc file. Alternatively it can be changed for a particular playbook execution by using the --forks CLI parameter. For example, the following executes the nova playbook against the control plane with 10 forks, then against the compute nodes with 50 forks.

# openstack-ansible --forks 10 os-nova-install.yml --limit compute_containers
# openstack-ansible --forks 50 os-nova-install.yml --limit compute_hosts

For more information about forks, please see the following references:

OpenStack-Ansible Bug 1479812
Ansible forks entry for ansible.cfg
Ansible Performance Tuning

Container management¶

With Ansible, the OpenStack installation process is entirely automated using playbooks written in YAML. After installation, the settings configured by the playbooks can be changed and modified. Services and containers can shift to accommodate certain environment requirements. Scaling services are achieved by adjusting services within containers, or adding new deployment groups. It is also possible to destroy containers, if needed, after changes and modifications are complete.

Scale individual services¶

Individual OpenStack services, and other open source project services, run within containers. It is possible to scale out these services by modifying the /etc/openstack_deploy/openstack_user_config.yml file.

Navigate into the /etc/openstack_deploy/openstack_user_config.yml file.
Access the deployment groups section of the configuration file. Underneath the deployment group name, add an affinity value line to container scales OpenStack services:
```
infra_hosts:
  infra1:
    ip: 10.10.236.100
    # Rabbitmq
    affinity:
      galera_container: 1
      rabbit_mq_container: 2
```
In this example, galera_container has a container value of one. In practice, any containers that do not need adjustment can remain at the default value of one, and should not be adjusted above or below the value of one.

The affinity value for each container is set at one by default. Adjust the affinity value to zero for situations where the OpenStack services housed within a specific container will not be needed when scaling out other required services.
Update the container number listed under the affinity configuration to the desired number. The above example has galera_container set at one and rabbit_mq_container at two, which scales RabbitMQ services, but leaves Galera services fixed.
Run the appropriate playbook commands after changing the configuration to create the new containers, and install the appropriate services.

For example, run the openstack-ansible lxc-containers-create.yml rabbitmq-install.yml commands from the openstack-ansible/playbooks repository to complete the scaling process described in the example above:
```
$ cd openstack-ansible/playbooks
$ openstack-ansible lxc-containers-create.yml rabbitmq-install.yml
```

Destroy and recreate containers¶

Resolving some issues may require destroying a container, and rebuilding that container from the beginning. It is possible to destroy and re-create a container with the lxc-containers-destroy.yml and lxc-containers-create.yml commands. These Ansible scripts reside in the openstack-ansible/playbooks repository.

Navigate to the openstack-ansible directory.

Run the openstack-ansible lxc-containers-destroy.yml commands, specifying the target containers and the container to be destroyed.

$ openstack-ansible lxc-containers-destroy.yml --limit "CONTAINER_NAME"
$ openstack-ansible lxc-containers-create.yml --limit "CONTAINER_NAME"

Replace ``CONTAINER_NAME`` with the target container.

Firewalls¶

OpenStack-Ansible does not configure firewalls for its infrastructure. It is up to the deployer to define the perimeter and its firewall configuration.

By default, OpenStack-Ansible relies on Ansible SSH connections, and needs the TCP port 22 to be opened on all hosts internally.

For more information on generic OpenStack firewall configuration, see the Firewalls and default ports

In each of the role’s respective documentatione you can find the default variables for the ports used within the scope of the role. Reviewing the documentation allow you to find the variable names if you want to use a different port.

Note

OpenStack-Ansible’s group vars conveniently expose the vars outside of the role scope in case you are relying on the OpenStack-Ansible groups to configure your firewall.

Finding ports for your external load balancer¶

As explained in the previous section, you can find (in each roles documentation) the default variables used for the public interface endpoint ports.

For example, the os_glance documentation lists the variable glance_service_publicuri. This contains the port used for the reaching the service externally. In this example, it is equal to glance_service_port, whose value is 9292.

As a hint, you could find the list of all public URI defaults by executing the following:

cd /etc/ansible/roles
grep -R -e publicuri -e port *

Note

Haproxy can be configured with OpenStack-Ansible. The automatically generated /etc/haproxy/haproxy.cfg file have enough information on the ports to open for your environment.

Prune Inventory Backup Archive¶

The inventory backup archive will require maintenance over a long enough period of time.

Bulk pruning¶

It is possible to do mass pruning of the inventory backup. The following example will prune all but the last 15 inventories from the running archive.

ARCHIVE="/etc/openstack_deploy/backup_openstack_inventory.tar"
tar -tvf ${ARCHIVE} | \
  head -n -15 | awk '{print $6}' | \
  xargs -n 1 tar -vf ${ARCHIVE} --delete

Selective Pruning¶

To prune the inventory archive selectively, first identify the files you wish to remove by listing them out.

tar -tvf /etc/openstack_deploy/backup_openstack_inventory.tar

-rw-r--r-- root/root    110096 2018-05-03 10:11 openstack_inventory.json-20180503_151147.json
-rw-r--r-- root/root    110090 2018-05-03 10:11 openstack_inventory.json-20180503_151205.json
-rw-r--r-- root/root    110098 2018-05-03 10:12 openstack_inventory.json-20180503_151217.json

Now delete the targeted inventory archive.

tar -vf /etc/openstack_deploy/backup_openstack_inventory.tar --delete openstack_inventory.json-20180503_151205.json

Maintenance tasks

Maintenance tasks¶

Galera cluster maintenance¶

Verify cluster status¶

Start a cluster¶

Galera cluster recovery¶

Recover a single-node failure¶

Recover a multi-node failure¶

Recover a complete environment failure¶

Rebuild a container¶

RabbitMQ cluster maintenance¶

Create a RabbitMQ cluster¶

Check the RabbitMQ cluster status¶

Stop and restart a RabbitMQ cluster¶

RabbitMQ and mnesia¶

Repair a partitioned RabbitMQ cluster for a single-node¶

Repair a partitioned RabbitMQ cluster for a multi-node cluster¶

Migrate between HA and Quorum queues¶

Running ad-hoc Ansible plays¶

Running the shell module¶

Running the copy module¶

Using tags¶

Ansible forks¶

Container management¶

Scale individual services¶

Destroy and recreate containers¶

Firewalls¶

Finding ports for your external load balancer¶

Prune Inventory Backup Archive¶

Bulk pruning¶

Selective Pruning¶

openstack-ansible 30.0.2.dev4

Page Contents