2024.1 Series (23.1.0 - 24.1.x) Release Notes

24.1.3

Security Issues

  • An issue in Ironic has been resolved where image checksums would not be checked prior to the conversion of an image to a raw format image from another image format.

    With default settings, this normally would not take place, however the image_download_source option, which is available to be set at a node level for a single deployment, by default for that baremetal node in all cases, or via the [agent]image_download_source configuration option when set to local. By default, this setting is http.

    This was in concert with the [DEFAULT]force_raw_images when set to True, which caused Ironic to download and convert the file.

    In a fully integrated context of Ironic’s use in a larger OpenStack deployment, where images are coming from the Glance image service, the previous pattern was not problematic. The overall issue was introduced as a result of the capability to supply, cache, and convert a disk image provided as a URL by an authenticated user.

    Ironic will now validate the user supplied checksum prior to image conversion on the conductor. This can be disabled using the [conductor]disable_file_checksum configuration option.

Bug Fixes

  • Fixes inspection failure when bmc_address or bmc_v6address is null in the inventory received from the ramdisk.

  • Fixes a security issue where Ironic would fail to checksum disk image files it downloads when Ironic had been requested to download and convert the image to a raw image format. This required the image_download_source to be explicitly set to local, which is not the default.

    This fix can be disabled by setting [conductor]disable_file_checksum to True, however this option will be removed in new major Ironic releases.

    As a result of this, parity has been introduced to align Ironic to Ironic-Python-Agent’s support for checksums used by standalone users of Ironic. This includes support for remote checksum files to be supplied by URL, in order to prevent breaking existing users which may have inadvertently been leveraging the prior code path. This support can be disabled by setting [conductor]disable_support_for_checksum_files to True.

  • Fixes aborting in-band inspection. Previously, it would fail with Can not transition from state 'inspect failed' on event 'abort'.

24.1.2

Upgrade Notes

  • When upgrading Ironic to address the qemu-img image conversion security issues, the ironic-python-agent ramdisks will also need to be upgraded.

  • When upgrading Ironic to address the qemu-img image conversion security issues, the [conductor]conductor_always_validates_images setting may be set to True as a short term remedy while ironic-python-agent ramdisks are being updated. Alternatively it may be advisable to also set the [agent]image_download_source setting to local to minimize redundant network data transfers.

  • As a result of security fixes to address qemu-img image conversion security issues, a new configuration parameter has been added to Ironic, [conductor]permitted_image_formats with a default value of “raw,qcow2,iso”. Raw and qcow2 format disk images are the image formats the Ironic community has consistently stated as what is supported and expected for use with Ironic. These formats also match the formats which the Ironic community tests. Operators who leverage other disk image formats, may need to modify this setting further.

Security Issues

  • Ironic now checks the supplied image format value against the detected format of the image file, and will prevent deployments should the values mismatch. If being used with Glance and a mismatch in metadata is identified, it will require images to be re-uploaded with a new image ID to represent corrected metadata. This is the result of CVE-2024-44082 tracked as bug 2071740.

  • Ironic always inspects the supplied user image content for safety prior to deployment of a node should the image pass through the conductor, even if the image is supplied in raw format. This is utilized to identify the format of the image and the overall safety of the image, such that source images with unknown or unsafe feature usage are explicitly rejected. This can be disabled by setting [conductor]disable_deep_image_inspection to True. This is the result of CVE-2024-44082 tracked as bug 2071740.

  • Ironic can also inspect images which would normally be provided as a URL for direct download by the ironic-python-agent ramdisk. This is not enabled by default as it will increase the overall network traffic and disk space utilization of the conductor. This level of inspection can be enabled by setting [conductor]conductor_always_validates_images to True. Once the ironic-python-agent ramdisk has been updated, it will perform similar image security checks independently, should an image conversion be required. This is the result of CVE-2024-44082 tracked as bug 2071740.

  • Ironic now explicitly enforces a list of permitted image types for deployment via the [conductor]permitted_image_formats setting, which defaults to “raw”, “qcow2”, and “iso”. While the project has classically always declared permissible images as “qcow2” and “raw”, it was previously possible to supply other image formats known to qemu-img, and the utility would attempt to convert the images. The “iso” support is required for “boot from ISO” ramdisk support.

  • Ironic now explicitly passes the source input format to executions of qemu-img to limit the permitted qemu disk image drivers which may evaluate an image to prevent any mismatched format attacks against qemu-img.

  • The ansible deploy interface example playbooks now supply an input format to execution of qemu-img. If you are using customized playbooks, please add “-f {{ ironic.image.disk_format }}” to your invocations of qemu-img. If you do not do so, qemu-img will automatically try and guess which can lead to known security issues with the incorrect source format driver.

  • Operators who have implemented any custom deployment drivers or additional functionality like machine snapshot, should review their downstream code to ensure they are properly invoking qemu-img. If there are any questions or concerns, please reach out to the Ironic project developers.

  • Operators are reminded that they should utilize cleaning in their environments. Disabling any security features such as cleaning or image inspection are at your own risk. Should you have any issues with security related features, please don’t hesitate to open a bug with the project.

  • The [conductor]disable_deep_image_inspection setting is conveyed to the ironic-python-agent ramdisks automatically, and will prevent those operating ramdisks from performing deep inspection of images before they are written.

  • The [conductor]permitted_image_formats setting is conveyed to the ironic-python-agent ramdisks automatically. Should a need arise to explicitly permit an additional format, that should take place in the Ironic service configuration.

Bug Fixes

  • Fixes an issue with units tests that show this DeprecationWarning: The metaschema specified by $schema was not found. Using the latest draft to validate, but this will raise an error in the future. cls = validator_for(schema) Removed the warning for deprecated schema by using a new template.

  • Fixes the issue of service steps not starting due to servicing states (states.SERVICING and states.SERVICEWAIT) missing from _FASTTRACK_HEARTBEAT_ALLOWED constant.

  • Fixes issue with configuring virtual media boot for executing service steps by adding missing entries for states.SERVICING and states.SERVICEWAIT in the whitelist of the states allowed by this method.

  • Fixes multiple issues in the handling of images as it relates to the execution of the qemu-img utility, which is used for image format conversion, where a malicious user could craft a disk image to potentially extract information from an ironic-conductor process’s operating environment.

    Ironic now explicitly enforces a list of approved image formats as a [conductor]permitted_image_formats list, which mirrors the image formats the Ironic project has historically tested and expressed as known working. Testing is not based upon file extension, but upon content fingerprinting of the disk image files. This is tracked as CVE-2024-44082 via bug 2071740.

  • Fixes usage of redfish detach virtual media feature to be conform to the general implementation. Before the detach virtual media API call using redfish driver was not working as intended and caused the operation to fail.

  • Fixes an issue in redfish attach/detach generic virtual media where the attached devices are not correctly recognized causing the attach operation to fail.

  • Service step validation no longer requires a priority field, which is not supported for servicing.

  • Fixes service steps that rely on a reboot. Previously, the reboot was not properly recognized in the conductor logic.

  • Adds an ISO publisher value to ISO images which are mastered as part of cleaning/deployment/service operations in support of a fix for bug 2032377.

  • Fixes generated URL when using the virtual media attachment API. Previously, it missed the node UUID, causing conflicts between different nodes.

24.1.0

Prelude

Ironic contributors are thrilled to present the release of 24.1.0, tested as part of OpenStack 2024.1 (Caracal) throughout the last six months. This release can be upgraded directly to from Ironic 21.4 as part of a SLURP upgrade from OpenStack 2023.1 (Antelope). Ironic’s first release came during the 2014.1 (Icehouse) cycle – a decade ago. In those ten years, redfish has been created, the default deploy driver has been replaced, and Ironic has expanded into the CNCF community with Metal3. Thanks for making us a part of your cloud!

New Features

  • Adds a http boot interface, based upon the pxe boot interface which informs the DHCP server of an HTTP URL to boot the machine from, and then requests the BMC boot the machine in UEFI HTTP mode.

  • Adds a http-ipxe boot interface, based upon the ipxe boot interface which informs the DHCP server of an HTTP URL to boot the machine from, and then requests the BMC boot the machine in UEFI HTTP mode.

  • Adds node auto-discovery support to the agent inspection implementation.

  • Add support for ovn vtep switches. Operators will be able to use logical and physical switches. Minimally tested in production.

  • Adds a new service ironic-pxe-filter that is designed to work with the agent inspect interface to conduct “unmanaged” inspection. It is adapted from the ironic-inspector’s dnsmasq PXE filter and can be used as its replacement. See documentation for more details.

  • Adds implementation of attach/detach generic virtual media device to the Redfish driver.

Known Issues

  • Testing of the http boot interface with Ubuntu 22.04 provided Grub2 yielded some intermittent failures which appear to be more environmental in nature as the signed Shim loader would start, then load the GRUB loader, and then some of the expected files might be attempted to be accessed, and then fail due to an apparent transfer timeout. Consultation with some grub developers concur this is likely environmental, meaning the specific grub build or CI performance related. If you encounter any issues, please do not hestitate to reach out to the Ironic developer community.

Upgrade Notes

  • Adds an online migration to the new inspection interface. If the agent inspection is enabled and the inspector inspection is disabled, the inspect_interface field will be updated for all nodes that use inspector and are currently not on inspection (i.e. not in the inspect wait or inspecting states).

    If some nodes may be inspecting during the upgrade, you may want to run the online migrations several times with a delay to finish migrating all nodes.

Deprecation Notes

  • The redfish vendor eject vmedia action is now deprecated and it will be removed during the next cycle in favor of the generic API.

Bug Fixes

  • Fixes Redfish virtual media boot on BMCs that only expose the VirtualMedia resource on Systems instead of Managers. For more informations, you can see bug 2039458.

  • Fixes a vague error when attempting to use the ilo hardware type with iLO6 hardware, by returning a more specific error suggesting action to take in order to remedy the issue. Specifically, one of the API’s used by the ilo hardware type is disabled in iLO6 BMCs in favor of users utilizing Redfish. Operators are advised to utilize the redfish hardware type for these machines.

  • Some of Ironic’s API endpoints, when the new RBAC policy is being enforced, were previously emitting 500 error codes when insufficent access rights were being used, specifically because the policy required system scope. This has been corrected, and the endpoints should now properly signal a 403 error code if insufficient access rights are present for an authenticated requestor.

  • Increases the 32-character limit of the user column in the NodeHistory model to support up to 64-character-long values. For more information, see bug.

  • Fixes issues with Lenovo hardware where the system firmware may display a blue “Boot Option Restoration” screen after the agent writes an image to the host in UEFI boot mode, requiring manual intervention before the deployed node boots. This issue is rooted in multiple changes being made to the underlying NVRAM configuration of the node. Lenovo engineers have suggested to only change the UEFI NVRAM and not perform any further changes via the BMC to configure the next boot. Ironic now does such on Lenovo hardware. More information and background on this issue can be discovered in bug 2053064.

  • Fixes an issue where the conductor service would fail to launch when the neutron network_interface setting was enabled, and no global cleaning_network or provisioning_network is set in ironic.conf. These settings have long been able to be applied on a per-node basis via the API. As such, the service can now be started and will error on node validation calls, as designed for drivers missing networking parameters.

  • Each conductor now reserves a small proportion of its worker threads (5% by default) for API requests and other critical tasks. This ensures that the API stays responsive even under extreme internal load.

  • Provides a fix for service role support to enable the use case where a dedicated service project is used for cloud service operation to facilitate actions as part of the operation of the cloud infrastructure.

    OpenStack clouds can take a variety of configuration models for service accounts. It is now possible to utilize the [DEFAULT] rbac_service_role_elevated_access setting to enable users with a service role in a dedicated service project to act upon the API similar to a “System” scoped “Member” where resources regardless of owner or lessee settings are available. This is needed to enable synchronization processes, such as nova-compute or the networking-baremetal ML2 plugin to perform actions across the whole of an Ironic deployment, if desirable where a “System” scoped user is also undesirable.

    This functionality can be tuned to utilize a customized project name aside from the default convention service, for example baremetal or admin, utilizing the [DEFAULT] rbac_service_project_name setting.

    Operators can alternatively entirely override the service_role RBAC policy rule, if so desired, however Ironic feels the default is both reasonable and delineates sufficiently for the variety of Role Based Access Control usage cases which can exist with a running Ironic deployment.

  • Query parameters in the API that expect lists now accept repeated arguments (param=value1&param=value2) in addition to comma-separated strings (param=value1,value2). The former seems to be more common and is actually (incorrectly) used in GopherCloud.

  • Fixes error handling in the virtual media attachment API when the image downloading fails. Now the last_error field is populated correctly and the error is logged.

24.0.0

New Features

  • Adds the capability to define a default_conductor_group setting which allows operators to assign a default conductor group to new nodes created in Ironic if they do not otherwise have a conductor_group set upon creation. By default, this setting has no value.

  • Adds support for Redfish based HTTPBoot, which leveragings the DMTF Redfish HttpBootUri ComputerSystem resource in a BMC, to assert the URL for the next boot operation. This requires Sushy 4.7.0 as the minimum version.

  • Adds a new capability allowing to attach or detach generic iso images as virtual media devices after a node has been provisioned.

  • Previously the key for building temporary URLs from Swift was taken from the x-account-meta-temp-url-key header in the object store account. Now the header x-account-meta-temp-url-key-2 is also checked, which allows password rotation to occur without breaking old URLs.

    This applies to the following temporary URL scenarios:

    • Temp URL image transfer from Glance (when [glance]swift_temp_url_key is not set)

    • Publishing an image with the Swift publisher ([redfish]use_swift=True or [ilo]use_web_server_for_images=False)

    • Storing the config drive in Swift ([deploy]configdrive_use_object_store=True)

    • Fetching Swift stored firmware update payloads.

  • Introducing basic authentication and configurable authentication strategy support for image and image checksum download processes. This feature introduces 3 new configuration variables that could be used to select the authentication strategy and provide credentials for authentication strategies. The 3 variables are structured in way that 1 of them [deploy]image_server_auth_strategy (string) provides the ability to select between authentication strategies by specifying the name of the authentication strategy.

    Currently the only supported authentication strategy is the http-basic which will make IPA use HTTP(S) basic authentication also known as the RFC 7617 standard. The other 2 variables are [deploy]image_server_password and [deploy]image_server_user provide username and password credentials for image download processes. The [deploy]image_server_password and [deploy]image_server_user are not strategy specific and could be reused for any username + password based authentication strategy, but for the moment these 2 variables are only used for the http-basic strategy.

    [deploy]image_server_auth_strategy doesn’t just enable the feature but enforces checks on the values of the 2 related credentials. When the http-basic strategy is enabled for image server download workflow the download logic will make sure to raise an exception in case any of the credentials are None or an empty string.

    Example of activating the http-basic strategy can be found in HTTP(s) Authentication strategy for user image servers section of the admin guide.

Upgrade Notes

  • The Ironic service API Role Based Access Control policy has been updated to disable the legacy RBAC policy by default. The effect of this is that deprecated legacy roles of baremetal_admin and baremetal_observer are no longer functional by default, and policy checks may prevent actions such as viewing nodes when access rights do not exist by default.

    This change is a result of the new policy which was introduced as part of Secure Role Based Access Control effort along with the Consistent and Secure RBAC community goal and the underlying [oslo_policy] enforce_scope and [oslo_policy] enforce_new_defaults settings being changed to True.

    The Ironic project believes most operators will observe no direct impact from this change, unless they are specifically running legacy access configurations utilizing the legacy roles for access.

    Operators which are suddenly unable to list or deploy nodes may have a misconfiguration in credentials, or need to allow the user’s project the ability to view and act upon the node through the node owner or lessee fields. By default, the Ironic API policy permits authenticated requests with a system scoped token to access all resources, and applies a finer grained access model across the API for project scoped users.

    Ironic users who have not already changed their nova-compute service settings for connecting to Ironic may also have issues scheduling Bare Metal nodes. Use of a system scoped user is available, by setting [ironic] system_scope to a value of all in your nova-compute service configuration, which can be done independently of other services, as long as the credentials supplied are also valid with Keystone for system scoped authentication.

    Heat users which encounter any issues after this upgrade, should check their user’s roles. Heat’s execution and model is entirely project scoped, which means users will need to have access granted through the owner or lessee field to work with a node.

    Operators wishing to revert to the old policy configuration may do so by setting the following values in ironic.conf.:

    [oslo_policy]
    enforce_new_defaults=False
    enforce_scope=False
    

    Operators who revert the configuration are encourated to make the necessary changes to their configuration, as the legacy RBAC policy will be removed at some point in the future in alignment with 2024.1-Release Timeline. Failure to do so will may force operators to craft custom policy override configuration.

  • Removes the sphinxcontrib-seqdiag dependency as the Pillow upgrade to version 10.x (from OpenStack upper constraints) breaks its usage. seqdiag has not been maintained for the last 3 years, hence the upgrade causes it to break. In the ironic docs (source) rst files, adds references to svg files, and keeps the svg files in the doc/source/images/ directory, alongside their associated .diag files as backup.

  • The default value of the configuration option [inspector]require_managed_boot is now True for the newer agent inspect interface. The older inspector implementation is not affected. Operators with deployments that support unmanaged inspection must set this value to False explicitly.

  • python-swiftclient is no longer a dependency, all OpenStack Swift operations are now down using openstacksdk.

    Configuration option [swift]swift_max_retries has been removed and any custom value will no longer have any effect on failed object-store operations.

Deprecation Notes

  • The deploy_kernel, deploy_ramdisk, rescue_kernel and rescue_ramdisk configuration options, incorrectly deprecated in the 2023.2 release series, are no longer deprecated.

  • The idrac hardware type management interface steps import_configuration and export_configuration steps are deprecated, and will be removed once a formalized generic step templating mechanism has been created within Ironic. The Ironic community is open to reconsidering this decision should the overall bulk configuration reset/templating model become adopted by DMTF Redfish as a standardized cross-vendor feature.

  • The ibmc hardware type is deprecated due to a lack of upstream communication, driver maintenance, and a recognition that the Redfish hardware type likely works for the users at this point. This driver is expected to be removed during the 2024.2 development cycle.

  • The xclarity hardware type is deprecated due to a lack of upstream communication, driver maintenance, and a recognition that the Redfish hardware type is suitable for Lenovo hardware users moving forward. This driver is expected to be removed during the 2024.2 development cycle.

  • The idrac-wsman interfaces on the idrac hardware type are deprecated due to a lack of upstream communication, and the decision of the driver’s maintainer in the past to move in to the direction of using Redfish for driver interactions. These driver interfaces are expected to be removed during the 2024.2 development cycle.

  • Rootwrap support is deprecated since Ironic no longer runs any commands as root. Files /etc/ironic/rootwrap.conf, /etc/ironic/rootwrap.d and the ironic-rootwrap command will be removed in a future release.

Bug Fixes

  • Firmware components are now also cached on the transition to the manageable state in addition to cleaning. This is consistent with how BIOS settings, vendor and boot mode are cached.

  • Fixes the behavior of file:/// image URLs pointing at a symlink. Ironic no longer creates a hard link to the symlink, which could cause confusing FileNotFoundError to happen if the symlink is relative.

  • Nodes no longer get stuck in cleaning when the firmware components caching code raises an unexpected exception.

  • Prevents a database constraints error on caching firmware components when a supported component does not have the current version.

  • Fixes an issue when listing allocations as a project scoped user when the legacy RBAC policies have been disabled which forced an HTTP 406 error being erroneously raised. Users attempting to list allocations with a specific owner, different from their own, will now receive an HTTP 403 error.

  • In case the lldp raw data collected by the inspection process includes non utf-8 information, the parser fails breaking the inspection process. This patch works around that excluding the malformed data and adding an entry in the logs to provide information on the failed tlv.

  • Fixes an issue where a System Scoped user could not trigger a node into a manageable state with cleaning enabled, as the Neutron client would attempt to utilize their user’s token to create the Neutron port for the cleaning operation, as designed. This is because with requests made in the system scope, there is no associated project and the request fails.

    Ironic now checks if the request has been made with a system scope, and if so it utilizes the internal credential configuration to communicate with Neutron.

  • When configured to listen on a unix socket, Ironic will now properly cleanup the unix socket on a clean service stop.

  • The idrac hardware type is now compatible with the redfish firmware interface. The link between them was missing initially.

  • Fixes the inspection lookup to consider all nodes with the same BMC hostname, as can happen with Redfish. In this case, the nodes are distinguished by MAC addresses.

  • Fixes getting details of a conductor if it uses a non-standard JSON RPC port or an IPv6 address as the name, e.g. GET /v1/conductors/[2001:db8::1]:8090. Previously, it would result in a HTTP error 400.

  • Fixes enable_netboot_fallback to write out pxe config on adopt.

  • When configuring secure boot via Redfish, internal server errors are now retried for a longer period than by default, accounting for the SecureBoot resource unavailability during configuration on some hardware.

  • Fixes Raid creation issue in iLO6 and other BMC with latest schema by removing ‘VolumeType’, ‘Encrypted’ and changing placement of ‘Drives’ to inside ‘Links’.

  • Fixes the payload format required to query physical storage drives using redfish, when configuring RAID using redfish.

  • Uses the volume_name provided in the target_raid_config field of a node to set the storage volume name when configuring RAID with the redfish driver (instead of discarding the volume_name given in target_raid_config)

  • Use the ‘volume_name’ field from the logical_disk in the target_raid_config field of a node, instead of just ‘name’ (which is incorrect as per the Ironic API expectation), to create the RAID volume using the Redfish driver

Other Notes

  • The classic ilo hardware types may be deprecated in the future for removal or major changes, however our last communication with the maintainers as of the 2024.1 Project Teams Gathering sessions indicated they were still working to determine their own forward path with a strong emphasis on the use of Redfish.

23.1.0

New Features

  • Sending signal SIGUSR2 to a conductor process will now trigger a drain shutdown. This is similar to a SIGTERM graceful shutdown but the timeout is determined by [DEFAULT]drain_shutdown_timeout which defaults to 1800 seconds. This is enough time for running tasks on existing reserved nodes to either complete or reach their own failure timeout.

    During the drain period the conductor will be removed from the hash ring to prevent new tasks from starting. Other conductors will no longer fail reserved nodes on the draining conductor, which previously appeared to be orphaned. This is achieved by running the conductor keepalive heartbeat for this period, but setting the online state to False.

  • While Ironic has not explicitly added support for OVN, because that is in theory a Neutron implementation detail, we have added some basic testing and are pleased to announce that you can use OVN’s DHCP service for IPv4 based provisioning with OVN v23.06.00 and beyond. This is not without issues, and we’ve added ovn documentation as a result to help provide as much Ironic operator clarity as possible.

Known Issues

  • Use of OVN may require disabling SNAT for provisioning with IPv4 when using TFTP. This is due to the Linux Kernel, and how IP packet handling occurs with OVN. No solution is known to this issue, and use of provisioning technologies which do not use TFTP is also advisable.

  • Use of OVN may require careful attention to the MTUs of networks. Oversized packets and networking may be dropped. That being said this is more likely an issue for testing than with actual physical baremetal in a production deployment.

  • Use of OVN for IPv6 based PXE/iPXE is not supported by Neutron. The Ironic project expects this to be addressed during the Caracal (2024.1) development cycle.

  • When configuring a single-conductor environment, make sure the number of worker pools ([conductor]worker_pool_size) is larger than the maximum parallel deployments ([conductor]max_concurrent_deploy). This was not the case by default previously (the options used to be set to 100 and 250 accordingly).

Upgrade Notes

  • Because of a fix in the internal worker pool handling, you may now start seeing requests rejected with HTTP 503 under a very high load earlier than before. In this case, try increasing the [conductor]worker_pool_size option or consider adding more conductors.

  • The default worker pool size (the [conductor]worker_pool_size option) has been increased from 100 to 300. You may want to consider increasing it even further if your environment allows that.

Bug Fixes

  • The parent_node field, a newly added API field, has been constrained to store UUIDs over the names of nodes. When names are used, the value is changed to the UUID of the node.

  • Properly eject the virtual media from a DVD device in case this is the only MediaType available from the Hardware, and Ironic requested CD as the device to be used. See bug 2039042 for details.

  • When Ironic hits the limit on the number of the concurrent deploys (specified in the [conductor]max_concurrent_deploy option), the resulting HTTP code is now 503 instead of the more generic 500.

  • The per-node external_http_url setting in the driver info is now used for a boot ISO. Previously this setting was only used for a config floppy.

  • Fixes issue of changing or getting state of indicator LED of attached disk caused by misunderstanding SimpleStorage provides this functionality but actually Storage resource does.

  • Fixes handling new requests when the maximum number of internal workers is reached. Previously, after reaching the maximum number of workers (100 by default), we would queue the same number of requests (100 again). This was not intentional, and now Ironic no longer queues requests if there are no free threads to run them.