Test Environment¶
This document should give you a good idea of what you can count on in the test environments managed by the OpenDev team. This information may be useful when creating new jobs or debugging existing jobs.
Unprivileged Single Use VMs¶
All jobs currently run on one or more of these nodes. These are single use VMs booted in OpenStack clouds.
Each single use VM has these attributes which you can count on:
Every instance has a public IP address. This may be an IPv4 address or an IPv6 address or maybe both.
You may not get both, it is entirely valid for an instance to have only a public IPv6 address and for another to have only a public IPv4 address.
In some cases the public IPv4 address is provided via NAT and the instance will only see a private IPv4 address. In some cases instances may have both a public and a private IPv4 address.
It is also possible that these addresses are on multiple network interfaces.
CPUs are all running x86-64 unless you explicitly choose an AArch64 (64-bit ARM) label or nodeset.
There is at least 8GB of system memory available on default node types, though we have limited availability of nodes with flavors providing up to 32GB memory.
There is at least 80GB of disk available. This disk may not all be exposed in a single filesystem partition and so not all mounted at /. Any additional disk can be partitioned, formatted and mounted by the root user. To give you an idea of what this can look like most clouds just give us an 80GB or bigger /. One cloud gives us a 40GB / and 80GB /opt. Generally you will want to write large things to /opt to take advantage of available disk.
Swap is not guaranteed to be present. Some clouds give us swap and others do not. Some jobs will create swap either using a second device if available or by using a file otherwise. Be aware you may need add tasks to create swap within your job if you require it.
Filesystems are ext4. If you need other filesystems you can create them on files mounted via loop devices.
Package mirrors and/or caches for PyPi, NPM, Ubuntu, Debian, Fedora and Centos (including EPEL) are provided and preconfigured on these instances before starting any jobs. We also have mirrors for Ceph and Ubuntu Cloud Archive that jobs must opt into using (details for these are written to disk on the test instances but are disabled by default).
Because these instances are single use we are able to give jobs full root access to them. This means you can install system packages, modify partition tables, and so on. Note that if you reboot the test instances you will need to restart the zuul-console process.
If jobs need to perform privileged actions they can do so using Zuul’s secrets. Things like AFS access tokens or dockerhub credentials can be stored in Zuul secrets then used by jobs to perform privileged actions requiring this data. Please refer to the Zuul documentation for more info.
Known Differences to Watch Out For¶
Underlying hypervisors are not all the same. You may run into KVM or Xen and possibly others depending on the cloud in use.
CPU count, speed, and supported processor flags differ, sometimes even within the same cloud region.
Nested virt is not available in all clouds. And in clouds where it is enabled we have observed a higher rate of crashed test VMs when using it. As a result we discourage general use of nested virt.
Some clouds give us multiple network interfaces, some only give us one. In the case of multiple network interfaces some clouds give all of them Internet routable addresses and some others do not.
Geographic location is widely variable. We have instances all across North America and in Europe. This may affect network performance between instances and network resources geographically distant.
Some Internet protocols may be blocked in some clouds. Specfically we have had problems with GRE and multicast IP. You can rely on TCP, UDP, and ICMP being functional on all of our clouds.
Network interface MTU of 1500 is not guaranteed. Some clouds give us smaller MTUs due to use of overlay networking. Test jobs should check interface MTUs and use an appropriate value for the current instance if creating new interfaces or bridges.
Why are Jobs for Changes Queued for a Long Time¶
We have a finite number of resources to run jobs on. We process jobs for changes in order based on a priority queuing system. This priority queue assigns test resources to Zuul queues based on the number of total changes in that queue. Changes at the heads of these queues are assigned resources before those at the end of the queues.
We have done this to ensure that large projects with many changes and long running jobs do not starve small projects with few changes and short jobs.
In order to make the queues run quicker there are several variables we can change:
Lower demand. Fewer changes and/or jobs will result in less demand for resources increasing availability for the changes that remain.
Reduce job resource costs. Reducing job runtime means those resources can be reused sooner by other jobs. Keep in mind that multinode jobs use a whole integer multiple more resources than single node jobs. You should only use multinode jobs where necessary to test specific interactions or to fit a complex test case into the resources we have.
Improve job reliability. If jobs fail because the tests or software under test are unreliable then we have to run more jobs to successfully merge our software. This effect is compounded by our gate queues because anytime we have a change that fails we must remove it from the queue, rebuild the queue without that change, then restart all jobs in the queue with that change evicted.
Keep in mind that we are also dog fooding OpenStack to run OpenStack’s CI system. This means that a more reliable OpenStack is able to provide resources to our CI system effectively. Fixing OpenStack in this case is a win win situation.
Add resources to our pools. If we have more total resources then we will have more to spread around.
In general, we would like to see our software perform the testing that the developers feel is necessary. We should do so responsibly. What this means is instead of deleting jobs or ignoring changes we should improve our test reliability to ensure changes exit queues as quickly as possible with minimal resource cost. This then ensures the changes behind are able to get resources quickly.
We are also always happy to add resources if they are available, but the priority from the project should be to ensure we are using what we do have responsibly.
Handling Zuul Secrets¶
Zuul secrets are the expected means of safely incorporating secret data (e.g., passwords or cryptographic keys) into job definitions. See the Using Secrets section of the Project Driver’s Guide chapter for some basic user guidance on this feature.
Credentials and similar secrets encrypted for the per-project keys Zuul uses cannot be decrypted except by Zuul and (by extension) the root sysadmins operating the Zuul service and maintaining the job nodes where those secrets are utilized. By policy, these sysadmins will not deliberately decrypt secrets or access decrypted secrets, aside from non-production test vectors used to ensure the feature is working correctly. They will not under any circumstances be able to provide decrypted copies of your project’s secrets on request, and so you cannot consider the encrypted copy as a backup but should instead find ways to safely maintain (and if necessary share) your own backup copies if you’re unable to easily revoke/replace them when lost.