[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
Maintenance tasks¶
This chapter is intended for OpenStack-Ansible specific maintenance tasks.
[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
Galera cluster maintenance¶
Routine maintenance includes gracefully adding or removing nodes from the cluster without impacting operation and also starting a cluster after gracefully shutting down all nodes.
MySQL instances are restarted when creating a cluster, when adding a
node, when the service is not running, or when changes are made to the
/etc/mysql/my.cnf
configuration file.
Verify cluster status¶
Compare the output of the following command with the following output. It should give you information about the status of your cluster.
# ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
node3_galera_container-3ea2cbd3 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)
node2_galera_container-49a47d25 | FAILED | rc=1 >>
ERROR 2002 (HY000): Can't connect to local MySQL server
through socket '/var/run/mysqld/mysqld.sock' (2)
node4_galera_container-76275635 | success | rc=0 >>
Variable_name Value
wsrep_cluster_conf_id 7
wsrep_cluster_size 1
wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1
wsrep_cluster_status Primary
In this example, only one node responded.
Gracefully shutting down the MariaDB service on all but one node allows the remaining operational node to continue processing SQL requests. When gracefully shutting down multiple nodes, perform the actions sequentially to retain operation.
Start a cluster¶
Gracefully shutting down all nodes destroys the cluster. Starting or restarting a cluster from zero nodes requires creating a new cluster on one of the nodes.
Start a new cluster on the most advanced node. Change to the
playbooks
directory and check theseqno
value in thegrastate.dat
file on all of the nodes:# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat" node2_galera_container-49a47d25 | success | rc=0 >> # GALERA saved state version: 2.1 uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 seqno: 31 cert_index: node3_galera_container-3ea2cbd3 | success | rc=0 >> # GALERA saved state version: 2.1 uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 seqno: 31 cert_index: node4_galera_container-76275635 | success | rc=0 >> # GALERA saved state version: 2.1 uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1 seqno: 31 cert_index:
In this example, all nodes in the cluster contain the same positive
seqno
values as they were synchronized just prior to graceful shutdown. If allseqno
values are equal, any node can start the new cluster.## for init # /etc/init.d/mysql start --wsrep-new-cluster ## for systemd # systemctl set-environment _WSREP_NEW_CLUSTER='--wsrep-new-cluster' # systemctl start mysql # systemctl set-environment _WSREP_NEW_CLUSTER=''
Please also have a look at upstream starting a cluster page
This can also be done with the help of ansible using the shell module:
# ansible galera_container -m shell -a "/etc/init.d/mysql start --wsrep-new-cluster" --limit galera_container[0]
This command results in a cluster containing a single node. The
wsrep_cluster_size
value shows the number of nodes in the cluster.node2_galera_container-49a47d25 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 1 wsrep_cluster_size 1 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Restart MariaDB on the other nodes (replace [0] from previous ansible command with [1:]) and verify that they rejoin the cluster.
node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 3 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 3 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 3 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Galera cluster recovery¶
Run the galera-install
playbook using the galera-bootstrap
tag
to automatically recover a node or an entire environment.
Run the following Ansible command to show the failed nodes:
# openstack-ansible galera-install.yml --tags galera-bootstrap
The cluster comes back online after completion of this command. If this fails, please review restarting the cluster and recovering the primary component in the galera documentation as they’re invaluable for a full cluster recovery.
Recover a single-node failure¶
If a single node fails, the other nodes maintain quorum and continue to process SQL requests.
Change to the
playbooks
directory and run the following Ansible command to determine the failed node:# ansible galera_container -m shell -a "mysql \ -e 'show status like \"%wsrep_cluster_%\";'" node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
In this example, node 3 has failed.
Restart MariaDB on the failed node and verify that it rejoins the cluster.
If MariaDB fails to start, run the
mysqld
command and perform further analysis on the output. As a last resort, rebuild the container for the node.
Recover a multi-node failure¶
When all but one node fails, the remaining node cannot achieve quorum and stops processing SQL requests. In this situation, failed nodes that recover cannot join the cluster because it no longer exists.
Run the following Ansible command to show the failed nodes:
# ansible galera_container -m shell -a "mysql \ -e 'show status like \"%wsrep_cluster_%\";'" node2_galera_container-49a47d25 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 18446744073709551615 wsrep_cluster_size 1 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status non-Primary
In this example, nodes 2 and 3 have failed. The remaining operational server indicates
non-Primary
because it cannot achieve quorum.Run the following command to rebootstrap the operational node into the cluster:
# mysql -e "SET GLOBAL wsrep_provider_options='pc.bootstrap=yes';" node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 15 wsrep_cluster_size 1 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node3_galera_container-3ea2cbd3 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111) node2_galera_container-49a47d25 | FAILED | rc=1 >> ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)
The remaining operational node becomes the primary node and begins processing SQL requests.
Restart MariaDB on the failed nodes and verify that they rejoin the cluster:
# ansible galera_container -m shell -a "mysql \ -e 'show status like \"%wsrep_cluster_%\";'" node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 17 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
If MariaDB fails to start on any of the failed nodes, run the
mysqld
command and perform further analysis on the output. As a last resort, rebuild the container for the node.
Recover a complete environment failure¶
Restore from backup if all of the nodes in a Galera cluster fail (do not
shutdown gracefully). Change to the playbook
directory and run the
following command to determine if all nodes in
the cluster have failed:
# ansible galera_container -m shell -a "cat /var/lib/mysql/grastate.dat"
node3_galera_container-3ea2cbd3 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno: -1
cert_index:
node2_galera_container-49a47d25 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno: -1
cert_index:
node4_galera_container-76275635 | success | rc=0 >>
# GALERA saved state
version: 2.1
uuid: 338b06b0-2948-11e4-9d06-bef42f6c52f1
seqno: -1
cert_index:
All the nodes have failed if mysqld
is not running on any of the
nodes and all of the nodes contain a seqno
value of -1.
If any single node has a positive seqno
value, then that node can be
used to restart the cluster. However, because there is no guarantee that
each node has an identical copy of the data, we do not recommend to
restart the cluster using the --wsrep-new-cluster
command on one
node.
Rebuild a container¶
Recovering from certain failures require rebuilding one or more containers.
Disable the failed node on the load balancer.
Note
Do not rely on the load balancer health checks to disable the node. If the node is not disabled, the load balancer sends SQL requests to it before it rejoins the cluster and cause data inconsistencies.
Destroy the container and remove MariaDB data stored outside of the container:
# lxc-stop -n node3_galera_container-3ea2cbd3 # lxc-destroy -n node3_galera_container-3ea2cbd3 # rm -rf /openstack/node3_galera_container-3ea2cbd3/*
In this example, node 3 failed.
Run the host setup playbook to rebuild the container on node 3:
# openstack-ansible setup-hosts.yml -l node3 \ -l node3_galera_container-3ea2cbd3
The playbook restarts all other containers on the node.
Run the infrastructure playbook to configure the container specifically on node 3:
# openstack-ansible setup-infrastructure.yml \ --limit node3_galera_container-3ea2cbd3
Avertissement
The new container runs a single-node Galera cluster, which is a dangerous state because the environment contains more than one active database with potentially different data.
# ansible galera_container -m shell -a "mysql \ -e 'show status like \"%wsrep_cluster_%\";'" node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 1 wsrep_cluster_size 1 wsrep_cluster_state_uuid da078d01-29e5-11e4-a051-03d896dbdb2d wsrep_cluster_status Primary node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 4 wsrep_cluster_size 2 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 4 wsrep_cluster_size 2 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Restart MariaDB in the new container and verify that it rejoins the cluster.
Note
In larger deployments, it may take some time for the MariaDB daemon to start in the new container. It will be synchronizing data from the other MariaDB servers during this time. You can monitor the status during this process by tailing the
/var/log/mysql_logs/galera_server_error.log
log file.Lines starting with
WSREP_SST
will appear during the sync process and you should see a line withWSREP: SST complete, seqno: <NUMBER>
if the sync was successful.# ansible galera_container -m shell -a "mysql \ -e 'show status like \"%wsrep_cluster_%\";'" node2_galera_container-49a47d25 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 5 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node3_galera_container-3ea2cbd3 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 5 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary node4_galera_container-76275635 | success | rc=0 >> Variable_name Value wsrep_cluster_conf_id 5 wsrep_cluster_size 3 wsrep_cluster_state_uuid 338b06b0-2948-11e4-9d06-bef42f6c52f1 wsrep_cluster_status Primary
Enable the previously failed node on the load balancer.
[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
RabbitMQ cluster maintenance¶
A RabbitMQ broker is a logical grouping of one or several Erlang nodes with each node running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings, and runtime parameters. A collection of nodes is often referred to as a cluster. For more information on RabbitMQ clustering, see RabbitMQ cluster.
Within OpenStack-Ansible, all data and states required for operation of the RabbitMQ cluster is replicated across all nodes including the message queues providing high availability. RabbitMQ nodes address each other using domain names. The hostnames of all cluster members must be resolvable from all cluster nodes, as well as any machines where CLI tools related to RabbitMQ might be used. There are alternatives that may work in more restrictive environments. For more details on that setup, see Inet Configuration.
Note
There is currently an Ansible bug in regards to HOSTNAME
. If
the host .bashrc
holds a var named HOSTNAME
, the container where the
lxc_container
module attaches will inherit this var and potentially
set the wrong $HOSTNAME
. See
the Ansible fix which will
be released in Ansible version 2.3.
Create a RabbitMQ cluster¶
RabbitMQ clusters can be formed in two ways:
Manually with
rabbitmqctl
Declaratively (list of cluster nodes in a config, with
rabbitmq-autocluster
, orrabbitmq-clusterer
plugins)
Note
RabbitMQ brokers can tolerate the failure of individual nodes within the cluster. These nodes can start and stop at will as long as they have the ability to reach previously known members at the time of shutdown.
There are two types of nodes you can configure: disk and RAM nodes. Most commonly, you will use your nodes as disk nodes (preferred). Whereas RAM nodes are more of a special configuration used in performance clusters.
RabbitMQ nodes and the CLI tools use an erlang cookie
to determine whether
or not they have permission to communicate. The cookie is a string
of alphanumeric characters and can be as short or as long as you would like.
Note
The cookie value is a shared secret and should be protected and kept private.
The default location of the cookie on *nix
environments is
/var/lib/rabbitmq/.erlang.cookie
or in $HOME/.erlang.cookie
.
Astuce
While troubleshooting, if you notice one node is refusing to join the cluster, it is definitely worth checking if the erlang cookie matches the other nodes. When the cookie is misconfigured (for example, not identical), RabbitMQ will log errors such as « Connection attempt from disallowed node » and « Could not auto-cluster ». See clustering for more information.
To form a RabbitMQ Cluster, you start by taking independent RabbitMQ brokers and re-configuring these nodes into a cluster configuration.
Using a 3 node example, you would be telling nodes 2 and 3 to join the cluster of the first node.
Login to the 2nd and 3rd node and stop the rabbitmq application.
Join the cluster, then restart the application:
rabbit2$ rabbitmqctl stop_app Stopping node rabbit@rabbit2 ...done. rabbit2$ rabbitmqctl join_cluster rabbit@rabbit1 Clustering node rabbit@rabbit2 with [rabbit@rabbit1] ...done. rabbit2$ rabbitmqctl start_app Starting node rabbit@rabbit2 ...done.
Check the RabbitMQ cluster status¶
Run
rabbitmqctl cluster_status
from either node.
You will see rabbit1
and rabbit2
are both running as before.
The difference is that the cluster status section of the output, both nodes are now grouped together:
rabbit1$ rabbitmqctl cluster_status
Cluster status of node rabbit@rabbit1 ...
[{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2]}]},
{running_nodes,[rabbit@rabbit2,rabbit@rabbit1]}]
...done.
To add the third RabbitMQ node to the cluster, repeat the above process by stopping the RabbitMQ application on the third node.
Join the cluster, and restart the application on the third node.
Execute
rabbitmq cluster_status
to see all 3 nodes:rabbit1$ rabbitmqctl cluster_status Cluster status of node rabbit@rabbit1 ... [{nodes,[{disc,[rabbit@rabbit1,rabbit@rabbit2,rabbit@rabbit3]}]}, {running_nodes,[rabbit@rabbit3,rabbit@rabbit2,rabbit@rabbit1]}] ...done.
Stop and restart a RabbitMQ cluster¶
To stop and start the cluster, keep in mind the order in which you shut the nodes down. The last node you stop, needs to be the first node you start. This node is the master.
If you start the nodes out of order, you could run into an issue where it thinks the current master should not be the master and drops the messages to ensure that no new messages are queued while the real master is down.
RabbitMQ and mnesia¶
Mnesia is a distributed database that RabbitMQ uses to store information about users, exchanges, queues, and bindings. Messages, however are not stored in the database.
For more information about Mnesia, see the Mnesia overview.
To view the locations of important Rabbit files, see File Locations.
Repair a partitioned RabbitMQ cluster for a single-node¶
Invariably due to something in your environment, you are likely to lose a node in your cluster. In this scenario, multiple LXC containers on the same host are running Rabbit and are in a single Rabbit cluster.
If the host still shows as part of the cluster, but it is not running, execute:
# rabbitmqctl start_app
However, you may notice some issues with your application as clients may be trying to push messages to the un-responsive node. To remedy this, forget the node from the cluster by executing the following:
Ensure RabbitMQ is not running on the node:
# rabbitmqctl stop_app
On the Rabbit2 node, execute:
# rabbitmqctl forget_cluster_node rabbit@rabbit1
By doing this, the cluster can continue to run effectively and you can repair the failing node.
Important
Watch out when you restart the node, it will still think it is part of the cluster and will require you to reset the node. After resetting, you should be able to rejoin it to other nodes as needed.
rabbit1$ rabbitmqctl start_app
Starting node rabbit@rabbit1 ...
Error: inconsistent_cluster: Node rabbit@rabbit1 thinks it's clustered
with node rabbit@rabbit2, but rabbit@rabbit2 disagrees
rabbit1$ rabbitmqctl reset
Resetting node rabbit@rabbit1 ...done.
rabbit1$ rabbitmqctl start_app
Starting node rabbit@mcnulty ...
...done.
Repair a partitioned RabbitMQ cluster for a multi-node cluster¶
The same concepts apply to a multi-node cluster that exist in a single-node cluster. The only difference is that the various nodes will actually be running on different hosts. The key things to keep in mind when dealing with a multi-node cluster are:
When the entire cluster is brought down, the last node to go down must be the first node to be brought online. If this does not happen, the nodes will wait 30 seconds for the last disc node to come back online, and fail afterwards.
If the last node to go offline cannot be brought back up, it can be removed from the cluster using the forget_cluster_node command.
If all cluster nodes stop in a simultaneous and uncontrolled manner, (for example, with a power cut) you can be left with a situation in which all nodes think that some other node stopped after them. In this case you can use the force_boot command on one node to make it bootable again.
Consult the rabbitmqctl manpage for more information.
[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
Running ad-hoc Ansible plays¶
Being familiar with running ad-hoc Ansible commands is helpful when operating your OpenStack-Ansible deployment. For a review, we can look at the structure of the following ansible command:
$ ansible example_group -m shell -a 'hostname'
This command calls on Ansible to run the example_group
using
the -m
shell module with the -a
argument which is the hostname command.
You can substitute example_group for any groups you may have defined. For
example, if you had compute_hosts
in one group and infra_hosts
in
another, supply either group name and run the command. You can also use the
*
wild card if you only know the first part of the group name, for
instance if you know the group name starts with compute you would use
compute_h*
. The -m
argument is for module.
Modules can be used to control system resources or handle the execution of system commands. For more information about modules, see Module Index and About Modules.
If you need to run a particular command against a subset of a group, you
could use the limit flag -l
. For example, if a compute_hosts
group
contained compute1
, compute2
, compute3
, and compute4
, and you
only needed to execute a command on compute1
and compute4
you could
limit the command as follows:
$ ansible example_group -m shell -a 'hostname' -l compute1,compute4
Note
Each host is comma-separated with no spaces.
Note
Run the ad-hoc Ansible commands from the openstack-ansible/playbooks
directory.
For more information, see Inventory and Patterns.
Running the shell module¶
The two most common modules used are the shell
and copy
modules. The
shell
module takes the command name followed by a list of space delimited
arguments. It is almost like the command module, but runs the command through
a shell (/bin/sh
) on the remote node.
For example, you could use the shell module to check the amount of disk space on a set of Compute hosts:
$ ansible compute_hosts -m shell -a 'df -h'
To check on the status of your Galera cluster:
$ ansible galera_container -m shell -a "mysql \
-e 'show status like \"%wsrep_cluster_%\";'"
When a module is being used as an ad-hoc command, there are a few parameters
that are not required. For example, for the chdir
command, there is no need
to chdir=/home/user ls when running Ansible from the CLI:
$ ansible compute_hosts -m shell -a 'ls -la /home/user'
For more information, see shell - Execute commands in nodes.
Running the copy module¶
The copy module copies a file on a local machine to remote locations. To copy files from remote locations to the local machine you would use the fetch module. If you need variable interpolation in copied files, use the template module. For more information, see copy - Copies files to remote locations.
The following example shows how to move a file from your deployment host to the
/tmp
directory on a set of remote machines:
$ ansible remote_machines -m copy -a 'src=/root/FILE '\
'dest=/tmp/FILE'
The fetch module gathers files from remote machines and stores the files locally in a file tree, organized by the hostname from remote machines and stores them locally in a file tree, organized by hostname.
Note
This module transfers log files that might not be present, so a missing
remote file will not be an error unless fail_on_missing
is set to
yes
.
The following examples shows the nova-compute.log
file being pulled
from a single Compute host:
root@libertylab:/opt/rpc-openstack/openstack-ansible/playbooks# ansible compute_hosts -m fetch -a 'src=/var/log/nova/nova-compute.log dest=/tmp'
aio1 | success >> {
"changed": true,
"checksum": "865211db6285dca06829eb2215ee6a897416fe02",
"dest": "/tmp/aio1/var/log/nova/nova-compute.log",
"md5sum": "dbd52b5fd65ea23cb255d2617e36729c",
"remote_checksum": "865211db6285dca06829eb2215ee6a897416fe02",
"remote_md5sum": null
}
root@libertylab:/opt/rpc-openstack/openstack-ansible/playbooks# ls -la /tmp/aio1/var/log/nova/nova-compute.log
-rw-r--r-- 1 root root 2428624 Dec 15 01:23 /tmp/aio1/var/log/nova/nova-compute.log
Ansible forks¶
The default MaxSessions
setting for the OpenSSH Daemon is 10. Each Ansible
fork makes use of a session. By default, Ansible sets the number of forks to
5. However, you can increase the number of forks used in order to improve
deployment performance in large environments.
Note that more than 10 forks will cause issues for any playbooks which use
delegate_to
or local_action
in the tasks. It is recommended that the
number of forks are not raised when executing against the control plane, as
this is where delegation is most often used.
The number of forks used may be changed on a permanent basis by including
the appropriate change to the ANSIBLE_FORKS
in your .bashrc
file.
Alternatively it can be changed for a particular playbook execution by using
the --forks
CLI parameter. For example, the following executes the nova
playbook against the control plane with 10 forks, then against the compute
nodes with 50 forks.
# openstack-ansible --forks 10 os-nova-install.yml --limit compute_containers
# openstack-ansible --forks 50 os-nova-install.yml --limit compute_hosts
For more information about forks, please see the following references:
OpenStack-Ansible Bug 1479812
Ansible forks entry for ansible.cfg
[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
Container management¶
With Ansible, the OpenStack installation process is entirely automated using playbooks written in YAML. After installation, the settings configured by the playbooks can be changed and modified. Services and containers can shift to accommodate certain environment requirements. Scaling services are achieved by adjusting services within containers, or adding new deployment groups. It is also possible to destroy containers, if needed, after changes and modifications are complete.
Scale individual services¶
Individual OpenStack services, and other open source project services,
run within containers. It is possible to scale out these services by
modifying the /etc/openstack_deploy/openstack_user_config.yml
file.
Navigate into the
/etc/openstack_deploy/openstack_user_config.yml
file.Access the deployment groups section of the configuration file. Underneath the deployment group name, add an affinity value line to container scales OpenStack services:
infra_hosts: infra1: ip: 10.10.236.100 # Rabbitmq affinity: galera_container: 1 rabbit_mq_container: 2
In this example,
galera_container
has a container value of one. In practice, any containers that do not need adjustment can remain at the default value of one, and should not be adjusted above or below the value of one.The affinity value for each container is set at one by default. Adjust the affinity value to zero for situations where the OpenStack services housed within a specific container will not be needed when scaling out other required services.
Update the container number listed under the
affinity
configuration to the desired number. The above example hasgalera_container
set at one andrabbit_mq_container
at two, which scales RabbitMQ services, but leaves Galera services fixed.Run the appropriate playbook commands after changing the configuration to create the new containers, and install the appropriate services.
For example, run the openstack-ansible lxc-containers-create.yml rabbitmq-install.yml commands from the
openstack-ansible/playbooks
repository to complete the scaling process described in the example above:$ cd openstack-ansible/playbooks $ openstack-ansible lxc-containers-create.yml rabbitmq-install.yml
Destroy and recreate containers¶
Resolving some issues may require destroying a container, and rebuilding
that container from the beginning. It is possible to destroy and
re-create a container with the lxc-containers-destroy.yml
and
lxc-containers-create.yml
commands. These Ansible scripts reside in the
openstack-ansible/playbooks
repository.
Navigate to the
openstack-ansible
directory.Run the openstack-ansible lxc-containers-destroy.yml commands, specifying the target containers and the container to be destroyed.
$ openstack-ansible lxc-containers-destroy.yml --limit "CONTAINER_NAME" $ openstack-ansible lxc-containers-create.yml --limit "CONTAINER_NAME"
Replace ``CONTAINER_NAME`` with the target container.
[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
Firewalls¶
OpenStack-Ansible does not configure firewalls for its infrastructure. It is up to the deployer to define the perimeter and its firewall configuration.
By default, OpenStack-Ansible relies on Ansible SSH connections, and needs the TCP port 22 to be opened on all hosts internally.
For more information on generic OpenStack firewall configuration, see the Firewalls and default ports
In each of the role’s respective documentatione you can find the default variables for the ports used within the scope of the role. Reviewing the documentation allow you to find the variable names if you want to use a different port.
Note
OpenStack-Ansible’s group vars conveniently expose the vars outside of the role scope in case you are relying on the OpenStack-Ansible groups to configure your firewall.
Finding ports for your external load balancer¶
As explained in the previous section, you can find (in each roles documentation) the default variables used for the public interface endpoint ports.
For example, the
os_glance documentation
lists the variable glance_service_publicuri
. This contains
the port used for the reaching the service externally. In
this example, it is equal to glance_service_port
, whose
value is 9292.
As a hint, you could find the list of all public URI defaults by executing the following:
cd /etc/ansible/roles
grep -R -e publicuri -e port *
Note
Haproxy
can be configured with OpenStack-Ansible.
The automatically generated /etc/haproxy/haproxy.cfg
file have
enough information on the ports to open for your environment.
[ English | Indonesia | français | Deutsch | English (United Kingdom) | 한국어 (대한민국) | español | русский ]
Prune Inventory Backup Archive¶
The inventory backup archive will require maintenance over a long enough period of time.
Bulk pruning¶
It is possible to do mass pruning of the inventory backup. The following example will prune all but the last 15 inventories from the running archive.
ARCHIVE="/etc/openstack_deploy/backup_openstack_inventory.tar"
tar -tvf ${ARCHIVE} | \
head -n -15 | awk '{print $6}' | \
xargs -n 1 tar -vf ${ARCHIVE} --delete
Selective Pruning¶
To prune the inventory archive selectively, first identify the files you wish to remove by listing them out.
tar -tvf /etc/openstack_deploy/backup_openstack_inventory.tar
-rw-r--r-- root/root 110096 2018-05-03 10:11 openstack_inventory.json-20180503_151147.json
-rw-r--r-- root/root 110090 2018-05-03 10:11 openstack_inventory.json-20180503_151205.json
-rw-r--r-- root/root 110098 2018-05-03 10:12 openstack_inventory.json-20180503_151217.json
Now delete the targeted inventory archive.
tar -vf /etc/openstack_deploy/backup_openstack_inventory.tar --delete openstack_inventory.json-20180503_151205.json