Cyborg-Nova-Glance Interaction in Compute Node¶
Cyborg is a service for managing accelerators, such as FPGAs, GPUs, etc. For scheduling an instance that needs accelerators, Cyborg needs to work with Nova at three levels:
Representation and Discovery: Cyborg shall represent accelerators as resources in Placement. When a device is discovered, Cyborg updates resource inventories in Placement.
Instance placement/scheduling: Cyborg may provide a weigher that prioritizes hosts based on available accelerator resources.
Attaching accelerators to instances. In the compute node, Cyborg shall define a workflow based on interacting with Nova through a new os-acc library (like os-vif and os-brick).
The first two aspects are addressed in 1. This spec addresses the attachment of accelerators to instances, via os-acc. For FPGAs, Cyborg also needs to interact with Glance for fetching bitstreams. Some aspects of that are covered in 2. This spec will address the interaction of Cyborg and Glance in the compute node.
This spec is common to all accelerators, including GPUs, High Precision Time Synchronization (HPTS) cards, etc. Since FPGAs have more aspects to be considered than other devices, some sections may focus on FPGA-specific factors. The spec calls out the FPGA-specific aspects.
Smart NICs based on FPGAs fall into two categories: those which expose the FPGA explicitly to the host, and those that do not. Cyborg’s current scope includes the former. This spec includes such devices, though the Cyborg-Neutron interaction is out of scope.
The scope of this spec is Rocky release.
Terminology¶
Accelerator: The unit that can be assigned to an instance for offloading specific functionality. For non-FPGA devices, it is either the device itself or a virtualized version of it (e.g. vGPUs). For FPGAs, an accelerator is either the entire device, a region within the device or a function.
Bitstream: An FPGA image, usually a binary file, possibly with vendor-specific metadata. A bitstream may implement one or more functions.
Function: A specific functionality, such as matrix multiplication or video transcoding, usually represented as a string or UUID. This term may be used with multi-function devices, including FPGAs and other fixed function hardware like Intel QuickAssist.
Region: A part of the FPGA which can be programmed without disrupting other parts of that FPGA. If an FPGA does not support Partial Reconfiguration, the entire device constitutes one region. A region may implement one or more functions.
Here is an example diagram for an FPGA with multiple regions, and multiple functions in a region:
PCI A PCI B
| |
+-------|--------|-------------------+
| | | |
| +----|--------|---+ +--------+ |
| | +--|--+ +---|-+ | | | |
| | | Fn A| | Fn B| | | | |
| | +-----+ +-----+ | | | |
| +-----------------+ +--------+ |
| Region 1 Region 2 |
| |
+------------------------------------+
Problem description¶
Once Nova has picked a compute node for placement of an instance that needs accelerators, the following steps needs to happen:
Nova compute on that node has to invoke Cyborg Agent for handling the needed accelerators. This needs to happen through a library, named os-acc, patterned after os-vif (Neutron) and os-brick (Cinder).
Cyborg Agent may call Glance to fetch a bitstream, either by id or based on tags.
Cyborg Agent may need to call into a Cyborg driver to program said bitstream.
Cyborg Agent needs to call into a Cyborg driver to prepare a device and/or obtain an attach handle (e.g. PCI BDF) that can be attached to the instance.
Cyborg Agent returns enough information to Nova compute via os-acc for the instance to be launched.
The behavior of each of these steps needs to be specified.
In addition, the OpenStack Compute API 3 specifies the operations that can be done on an instance. The behavior with respect to accelerators must be defined for each of these operations. That in turn is related to when Nova compute calls os-acc.
Use Cases¶
Please see 1. We intend to support FPGAaaS with request time programming, and AFaaS (both pre-programmed and orchestrator-programmed scenarios).
Cyborg will discover accelerator resources whenever the Cyborg agent starts up. PCI hot plug can be supported past Rocky release.
Cyborg must support all instance operations mentioned in OpenStack Compute API 3 in Rocky, except booting off a snapshot and live migration.
Proposed change¶
OpenStack Server API Behavior¶
The OpenStack Compute API 3 mentions the list of operations that can be performed on an instance. Of these, some will not be supported by Cyborg in Rocky. The list of supported operations (with the intended behaviors) are as follows:
When an instance is started, the accelerators requested by that instance’s flavor must be attached to the instance. On termination, those resources are released.
When an instance is paused, suspended or locked, the accelerator resources are left intact, and not detached from the instance. So, when the instance is unpaused, resumed or unlocked, there is nothing to do.
When an instance is shelved, the accelerator resources are detached. On an unshelve, it is expected that the build operation will go through the scheduler again, so it is equivalent to an instance start.
When an instance is deleted, the accelerator resources are detached. On a restore, it is expected that the build operation will go through the scheduler again, so it is equivalent to an instance start.
Reboot: The accelerator resources are left intact. It is up the instance software to rediscover attached resources.
Rebuild: Prior to the instance image replacement, all device access must be quiesced, i.e., accesses to devices from that instance must be completed and further accesses must be prohibited. The mechanics of such quiescing are outside the scope of this document. With that precondition, accelerator resources are left attached to the instance during the rebuild.
Resize (with change of flavor): It is equivalent to a termination followed by re-scheduling and restart. The accelerator resources are detached on termination, and re-attached on when the instance is scheduled again.
Cold migration: It is equivalent to a termination followed by re-scheduling and restart. The accelerator resources are detached on termination, and re-attached on when the instance is scheduled again.
Evacuate: This is a forcible rebuild by the administrator. As the semantics of evacuation are left open even without accelerators, Cyborg’s behavior is also left undefined.
Set administrator password, trigger crash dump: These are supported and not no-ops for accelerators.
The following instance operations are not supported in this release:
Booting off a snapshot: The snapshot may have been taken when the attached accelerators were in a particular state. When booting off a previous snapshot, the current configuration and state of accelerators may not match the snapshot. So, this is unsupported.
Live migration: Until a mechanism is defined to migrate accelerator state along with the instance, this is unsupported.
os_acc Structure¶
Cyborg will develop a new library named os-acc. That library will offer the APIs listed later in this section. Nova Compute calls these APIs if it sees that the requested flavor refers to CUSTOM_ACCELERATOR resource class, except for the initialize() call, which is called unconditionally. Nova Compute calls these APIs asynchronously, as suggested below:
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(os_acc.<api>, *args)
# do other stuff
try:
data = future.result()
except:
# handle exceptions
The APIs of os-acc are as below:
initialize()
Called once at start of day. Waits for Cyborg Agent to be ready to accept requests, i.e., all devices enumerated and traits published.
Returns None on success.
Throws
CyborgAgentUnavailable
exception if Cyborg Agent cannot be contacted.
plug(instance_info, selected_rp, flavor_extra_specs)
Parameters are all read-only. Here are their descriptions:
instance_info: dictionary containing instance UUID, instance name, project/tenant ID and VM image UUID. The instance name is needed for better logging, the project/tenant ID may be passed to some accelerator policy engine in the future and the VM image UUID may be used to query Glance for metadata about accelerator requirements that may be stored with the VM image.
selected_rp: Information about the selected resource provider is passed as a dictionary.
flavor_extra_specs: the extra_specs field in the flavor, including resource classes, traits and other fields interpreted by Cyborg.
Called by Nova compute when an instance is started, unshelved, or restored and after a resize or cold migration.
Called before an instance is built, i.e., before the specification of the instance is created. For libvirt-based hypervisors, this means the call happens before the instance’s domain XML is created.
As part of this call, Cyborg Agent may fetch bitstreams from Glance and initiate programming. It may fetch the bitstream specified in the request’s flavor extra specs, if any. If the request refers to a function ID/name, Cyborg Agent would query Glance to find bitstreams that provide the flavor and match the chosen device, and would then fetch the needed bitstream.
As part of this call, Cyborg Agent will locate the Deployable corresponding to the chosen RP, locate the attach handles (e.g. PCI BDF) needed, update its internal data structures in a persistent way, and return the needed information back to Nova.
Returns an array, with one entry per requested accelerator, each entry being a dictionary. The dictionary is structured as below for Rocky:
{ “pci_id”: <pci bdf> }unplug(instance_info)
Parameters are all read-only. Here are their descriptions:
instance_info: dictionary containing instance UUID and instance name. The instance name is needed for better logging.
Called when an instance is stopped, shelved, or deleted and before a resize or cold migration.
As part of this call, Cyborg Agent will clean up internal resources, call the appropriate Cyborg driver to clean up the device resources and update its data structures persistently.
Returns the number of accelerators that were released. Errors may cause exceptions to be thrown.
Workflows¶
The pseudocode for each os-acc API can be expressed as below:
def initialize():
# checks that all devices are discovered and their traits published
# waits if any discovery operation is ongoing
return None
def plug(instance_info, rp, extra_specs):
validate_params(....)
glance = glanceclient.Client(...)
driver = # select Cyborg driver for chosen rp
rp_deployable = # get deployable for RP
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_REGION_<uuid>`` and
extra_specs refers to ``bitstream:<uuid>``:
bitstream = glance.images.data(image_uuid)
driver.program(bitstream, rp_deployable, …)
if extra_specs refers to ``CUSTOM_FPGA_<vendor>_FUNCTION_<uuid>`` and
extra_specs refers to function UUID/name:
region_type_uuid = # fetch from selected RP
bitstreams = glance.images.list(...)
# queries Glance by function UUID/name property and region type
# UUID to get matching bitstreams
if len(bitstreams) > 1:
error(...) # bitstream choice policy is outside Cyborg
driver.program(bitstream, rp_deployable, …)
pci_bdf = driver.allocate_handle(...)
# update Cyborg DB with instance_info and BDF usage
return { “pci_id”: pci bdf }
def unplug(instance_info):
bdf_list = # fetch BDF usage from Cyborg DB for instance
# update Cyborg DB to mark those BDFs as free
return len(bdf_list)
Alternatives¶
N/A
Data model impact¶
None
REST API impact¶
None
Security impact¶
None
Notifications impact¶
None
Other end user impact¶
None
Performance Impact¶
None
Other deployer impact¶
None
Developer impact¶
None
Implementation¶
Assignee(s)¶
None
Work Items¶
Decide how to associate multiple functions/bitstreams in extra specs with multiple devices in the flavor.
Decide specific changes needed in Cyborg conductor, db, agent and drivers.
Others: TBD
Dependencies¶
Nested Resource Provider support in Nova
Testing¶
For each vendor driver supported in this release, we need to integrate the corresponding FPGA type(s) in the CI infrastructure.
Documentation Impact¶
The behavior with respect to accelerators during various instance operations (reboot, pause, etc.) must be documented. The procedure to upload a bitstream, including applying Glance properties, must also be documented.