OpenStack operators require information about the status and health of the Neutron system. While it is possible for an operator to pull all of the interface counters from compute and network nodes, today there is no capability to aggregate that information to provide comprehensive counters for each project within Neutron. Neutron instrumentation sets out to meet this need.
Neutron instrumentation can be broken down into three major pieces:
While instrumentation might also be considered to include asynchronous event notifications, like fault detection, this is considered out of scope for the following two reasons:
The existing metering label and rule extension provides the ability to collect traffic information on a per CIDR basis. Therefore, a possible implementation of instrumentation would be to use per-instance metering rules for all IP addresses in both directions. However, the information collected by metering rules is focused more on billing and so does not have the desired granularity (i.e. it counts transmitted packets without keeping track of what caused packets to fail).
The first step is to consider what data to collect. In the absence of a standard, it is proposed to use the information set defined in [RFC2863] and [RFC4293]. This proposal should not be read as implying that Neutron instrumentation data will be browsable via a MIB browser as that would be a potential Data Consumption model.
[RFC2863] | (1, 2) https://tools.ietf.org/html/rfc2863 |
[RFC4293] | https://tools.ietf.org/html/rfc4293 |
For the reference implementation (Nova/VIF, OVS, and Linux Bridge), this section identifies what data is already available and how it can be mapped into the structures defined by the RFC. Other plugins are welcome to define either their own data sets and/or their own mappings to the data sets defined in the referenced RFCs.
Focus here is on what is available from “stock” Linux and OpenStack. Additional statistics may become available if other items like NetFlow or sFlow are added to the mix, but those should be covered as an addition to the basic information discussed here.
Within Nova, the libvirt driver makes the following host traffic statistics available under the get_diagnostics() and get_instance_diagnostics() calls on a per-virtual NIC basis:
There continues to be a long running effort to get these counters into Ceilometer (the wiki page at [1] attempted to do this via a direct call while [2] is trying to accomplish this via notifications from Nova). Rather than propose another way for collecting these statistics from Nova, this devref takes the approach of declaring them out of scope until there is an agreed upon method for getting the counters from Nova to Ceilometer and then see if Neutron can/should piggy-back off of that.
[1] | https://wiki.openstack.org/wiki/EfficientMetering/FutureNovaInteractionModel |
[2] | http://lists.openstack.org/pipermail/openstack-dev/2015-June/067589.html |
For the Linux bridge, a check of [3] shows that IEEE 802.1d mandated statistics are only a “wishlist” item. The alternative is to use NETLINK/shell to list the interfaces attached to a particular bridge and then to collect statistics for each interface attached to the bridge. These statistics could then be mapped to appropriate places, as discussed below.
Note: the examples below talk in terms of mapping counters available from the Linux operating system:
Available counters for interfaces on other operating systems can be mapped in a similar fashion.
[3] | http://git.kernel.org/cgit/linux/kernel/git/shemminger/bridge-utils.git/tree/doc/WISHLIST |
Of interest are counters from the each of the following (as of this writing, Linux Bridge only supports legacy routers, so the DVR case need not be considered):
Like Linux bridge, the openvswitch implementation has interface counters that will be collected of interest are the receive and transmit counters from the following:
The following table summarizes how the interface counters are mapped into each MIB Data Set. Specific details are covered in the sections below:
Node | Interface | Included in Data Set | |
---|---|---|---|
RFC2863 | RFC4293 | ||
Compute | Instance tap | Yes | No |
Router qr | Yes | Yes | |
FIP fg | No | Yes | |
Network | DHCP tap | Yes | No |
Router qr | Yes | Yes | |
Router qg | No | Yes | |
SNAT sg | No | Yes |
Note: because of replication of the router qg interface when running distributed routing, aggregation of the individual counter information will be necessary to fill in the appropriate data set entries. This will be covered in the Data Aggregation section below:
For each compute host, each network will be represented with a “switch”, modeled by instances of ifTable and ifXTable. This mapping has the advantage that for a particular network, the view to the project or the operator is identical - the only difference is that the operator can see all networks, while a project will only see the networks under their project id.
The current reference implementation identifies tap interface names with the Neutron port they are associated with. In turn, the Neutron port identifies the Neutron network. Therefore, it is possible to take counters from each tap interface and map them into entries in the appropriate tables, using the following proposed assignments:
Section 3.1.6 of [RFC2863] provides the details of why 64-bit sized counters need to be supported. The summary is that with increasing transmission bandwidth use of 32-bit counters would require a problematic increase in counter polling frequency (a 1Gbs stream of full-sized packets will cause a 32-bit counter to wrap in 34 seconds).
Counters tracked by RFC 4293 come in two flavors: ones that are inherited from the interface, and those that track L3 events, such as fragmentation, re-assembly, truncations, etc. As the current instrumentation available from the reference implementation does not provide appropriate source information, the following counters are declared out of scope for this devref:
In ipIfStatsTable, the following counters will hold the same value as the referenced counter from RFC 2863:
For ipSystemStatsTable, the following counters will hold values based on the following assignments. Thess summations are covered in more detail in the Data Aggregation section below
There are two options for how data can be collected:
Because of the number of counters needed to be collected (for example, a cloud running legacy routing would need to collect (for each project) three counters from a network node and a tap counter for each running instance. While it would be desirable to reuse the existing L3 and ML2 agents, the initial proof of concept will run a separate agent that will use a separate threads to isolate the effects of counter collection from reporting. Once the performance of the collection agent is understood, then merging the functionality into the L3 or ML2 agents can be considered. The collection thread will initially use shell commands via rootwrap, with the plan of moving to native python libraries when support for them is available.
In addition, there are two options for how to report counters back to the Neutron server: push or pull (or asynchronous notification vs polling). On the one hand, pull/polling eases the Neutron server’s task in that it only needs to store/aggregate the results from the current polling cycle. However, this comes at the cost of dealing with the stale data issues that scaling a polling cycle will entail. On the other hand, asynchronous notification requires that the Neutron server has the capability to hold the current results from each collector. As the L3 and ML2 agents already have use asynchronous notification to report status back to the Neutron server, the proof of concept will follow the same model to ease a future merging of functionality.
Will be covered in a follow-on patch set.
Will be covered in a follow-on patch set.