Note
Much of this document discusses upgrade considerations for the Neutron reference implementation using Neutron’s agents. It’s expected that each Neutron plugin provides its own documentation that discusses upgrade considerations specific to that choice of backend. For example, OVN does not use Neutron agents, but does have a local controller that runs on each compute node. OVN supports rolling upgrades, but information about how that works should be covered in the documentation for networking-ovn, the OVN Neutron plugin.
There are two general upgrade scenarios supported by Neutron:
The latter is the preferred way to upgrade an OpenStack cloud, since it allows for more granularity and less service downtime. This scenario is usually called ‘rolling upgrade’.
Rolling upgrades imply that during some interval of time there will be services of different code versions running and interacting in the same cloud. It puts multiple constraints onto the software.
More info on rolling upgrades in OpenStack.
Those requirements are achieved in Neutron by:
To simplify the matter, it’s always assumed that the order of service upgrades is as following:
This approach allows us to avoid backwards compatibility code on agent side and is in line with other OpenStack projects that support rolling upgrades (specifically, nova).
Neutron-server is the very first component that should be upgraded to the new code. It’s also the only component that relies on new database schema to be present, other components communicate with the cloud through AMQP and hence do not depend on particular database state.
Database upgrades are implemented with alembic migration chains.
Database upgrade is split into two parts:
Each part represents a separate alembic branch.
The former step can be executed while old neutron-server code is running. The latter step requires all neutron-server instances to be shut down. Once it’s complete, neutron-servers can be started again.
Note
Full shutdown of neutron-server instances can be skipped depending on whether there are pending contract scripts not applied to the database:
$ neutron-db-manage has_offline_migrations
Command will return a message if there are pending contract scripts.
Note
This section does not apply when the cloud does not use AMQP agents to provide networking services to instances. In that case, other backend specific upgrade instructions may also apply.
Once neutron-server services are restarted with the new database schema and the new code, it’s time to upgrade Neutron agents.
Note that in the meantime, neutron-server should be able to serve AMQP messages sent by older versions of agents which are part of the cloud.
The recommended order of agent upgrade (per node) is:
The rationale of the agent upgrade order is that L2 agent is usually responsible for wiring ports for other agents to use, so it’s better to allow it to do its job first and then proceed with other agents that will use the already configured ports for their needs.
Each network/compute node can have its own upgrade schedule that is independent of other nodes.
Since it’s always assumed that neutron-server component is upgraded before agents, only the former should handle both old and new RPC versions.
The implication of that is that no code that handles UnsupportedVersion oslo.messaging exceptions belongs to agent code.
For notifications that are issued by neutron-server to listening agents, special consideration is needed to support rolling upgrades. In this case, a newer controller sends newer payload to older agents.
Until we have proper RPC version pinning feature to enforce older payload format during upgrade (as it’s implemented in other projects like nova), we leave our agents resistant against unknown arguments sent as part of server notifications. This is achieved by consistently capturing those unknown arguments with keyword arguments and ignoring them on agent side; and by not enforcing newer RPC entry point versions on server side.
This approach is not ideal, because it makes RPC API less strict. That’s why other approaches should be considered for notifications in the future.
An RPC interface is defined by its name, version, and (named) arguments that it accepts. There are no strict guarantees that arguments will have expected types or meaning, as long as they are serializable.
To provide better compatibility guarantees for rolling upgrades, RPC interfaces could also define specific format for arguments they accept. In OpenStack world, it’s usually implemented using oslo.versionedobjects library, and relying on the library to define serialized form for arguments that are passed through AMQP wire.
Note that Neutron has not adopted oslo.versionedobjects library for its RPC interfaces yet (except for QoS feature).
Backend software upgrade should not result in any data plane disruptions. Meaning, e.g. Open vSwitch L2 agent should not reset flows or rewire ports; Neutron L3 agent should not delete namespaces left by older version of the agent; Neutron DHCP agent should not require immediate DHCP lease renewal; etc.
The same considerations apply to setups that do not rely on agents. Meaning, f.e. OpenDaylight or OVN controller should not break data plane connectivity during its upgrade process.
Grenade is the OpenStack project that is designed to validate upgrade scenarios.
Currently, only offline (non-rolling) upgrade scenario is validated in Neutron gate. The upgrade scenario follows the following steps:
The scenario validates that no configuration option names are changed in one cycle. More generally, it validates that the ‘new’ cloud is capable of running using the ‘old’ configuration files. It also validates that database migration scripts can be executed.
The scenario does not validate AMQP versioning compatibility.
Other projects (for example Nova) have so called ‘partial’ grenade jobs where some services are left running using the old version of code. Such a job would be needed in Neutron gate to validate rolling upgrades for the project. Till that time, it’s all up to reviewers to catch compatibility issues in patches on review.
Another hole in testing belongs to split migration script branches. It’s assumed that an ‘old’ cloud can successfully run after ‘expand’ migration scripts from the ‘new’ cloud are applied to its database; but it’s not validated in gate.
There are several upgrade related gotchas that should be tracked by reviewers.
First things first, a general advice to reviewers: make sure new code does not violate requirements set by global OpenStack deprecation policy.
Now to specifics: