2024.1 Series (9.8.0 - 9.11.x) Release Notes

9.11.1

Upgrade Notes

  • Deployers implementing their own HardwareManagers must to audit their code for unsafe uses of qemu-img and related methods.

Security Issues

  • Ironic-Python-Agent now checks any supplied image format value against the detected format of the image file and will prevent deployments should the values mismatch.

  • Images previously misconfigured as raw despite being in another format, in some non-default configurations, may have been mistakenly converted if needed. Ironic-Python-Agent will no longer perform conversion in any case for images with metadata indicating in raw format.

  • Ironic-Python-Agent always inspects any non-raw user image content for safety before running any qemu-based utilities on the image. This is utilized to identify the format of the image and to verify the overall safety of the image. Any images with unknown or unsafe feature uses are explicitly rejected. This can be disabled in both IPA and Ironic by setting [conductor]disable_deep_image_inspection to True for the Ironic deployment. Image inspection is the primary mitigation for CVE-2024-44082 being tracked in bug 2071740. Operators may desire to set [conductor]conductor_always_validates_images on Ironic conductors to mitigate the issue before they have upgraded their Ironic-Python-Agent.

  • Ironic-Python-Agent now explicitly enforces a list of permitted image types for deployment, defaulting to “raw” and “qcow2”. Other image types may work, but are not explicitly supported and must be enabled. This can be modified by setting [conductor]permitted_image_formats for all Ironic services.

Bug Fixes

  • Fixes an issue where configuration drive volumes which are mounted by the operating system could remain mounted and cause a lock to be held, which may conflict with actions such as rebuild. The agent now always makes sure the folder used by Glean and Cloud-init is not mounted.

  • Fixes multiple issues in the handling of images as it related to execution of the qemu-img utility. When using this utility to convert an unsafe image, a malicious user can extract information from a node while Ironic-Python-Agent is deploying or converting an image. Ironic-Python-Agent now inspects all non-raw images for safety, and never runs qemu-based utilities on raw images. This fix is tracked as CVE-2024-44082 and bug 2071740.

  • Images with metadata indicating a “raw” disk format may have been transparently converted from another format. Now, these images will have their exact contents imaged to disk without modification.

  • Fixes bug 2066308, an issue where Ironic Python Agent would call evaluate_hardware_support multiple times on hardware manager plugins. Scanning for hardware and disks is time consuming, and caused timeouts on badly-performing nodes.

9.10.0

New Features

  • When the new Ironic built-in inspection is used, ipa-inspection-callback-url can now be automatically derived from ipa-api-url. In this case, inspection will be enabled if the ipa-inspection-collectors option is set.

Upgrade Notes

  • If you currently set the ipa-inspection-collectors option without setting ipa-inspection-callback-url, it will now cause inspection to run. Update your boot configuration to only supply the collectors when inspection is desired.

9.9.0

New Features

  • Add support for collecting the cpu socket number.

  • Supports several comma-separated URLs for ipa-api-url and ipa-inspection-callback-url. The URLs are probed in the provided order until one does not return a connection error. The primary use case it to support deploying nodes with only one IP stack from an Ironic installation that has both stacks.

Bug Fixes

  • Fixes missing Content-Type header when sending inspection data back to ironic-inspector or ironic. While ironic-inspector tolerates the missing header, it may cause issues with the new inspection implementation.

  • Fixes referencing to raid_device variable before assignment, is replaced by blk variable.

  • Adds random jitter to retried heartbeats after Ironic returns an error. Previously, heartbeats would be retried after 5 seconds, potentially causing a thundering herd problem if many nodes fail to heartbeat at the same time.

  • Inspection is now retried on HTTP 409 (conflict), which can be returned by the new implementation in Ironic.

  • Fixes the post data to inspector to retry in 50X errors.

9.8.0

New Features

  • Introducing basic authentication and configurable authentication strategy support for image and image checksum download processes. This feature introduces 3 new variables that could be set (either via oslo.config or image_info) to select the authentication strategy an provide credentials for HTTP(S) basic authentication. The 3 variables are structured in way that 1 of them ‘image_server_auth_strategy’ (string) provides the ability to select between authentication strategies by specifying the name of the strategy. Currently the only supported authentication strategy is the ‘http-basic’ which will make IPA use HTTP(S) basic authentication also known as the ‘RFC 7617’ standard. The other 2 variables ‘image_server_password’ and ‘image_server_user’ provide username and password credentials for image download processes. The ‘image_server_password’ and ‘image_server_user’ are not strategy specific and could be reused for any username + password based authentication strategy, but for the moment these 2 variables are only used for the ‘http-basic’ strategy. ‘image_server_basic_auth’ not just enables the feature but enforces checks on the values of the 2 related credentials. When the ‘http-basic’ strategy is enabled for image server download workflow the download logic will make sure to raise an exception in case any of the credentials are None or an empty string. Values coming from ‘image_info’ are prioritized over values coming from the ‘oslo.config’ framework and the 2 different credential source can’t be mixed. Passing 1 or 2 out of the 3 from and source and the remaining values from an other source will result in a exception.

  • Adds numa_node id when collecting pci device info during inspection.

  • The Nvidia Mellanox “clean steps” to facilitate firmware updates of their network devices, are now available as deploy steps and service steps as well.

Deprecation Notes

  • Removed the partial-implementation of image caching. This method was never used by Ironic, and was a vestige of a feature which was never completed. Image caching, if it were to be implemented today, would not use this method anyway.

Bug Fixes

  • Fixes a failure case where downloads would not be retried when the checksum fails verification. the agent now includes the checksum activity as part of the file download operation, and will automatically retry downloads when the checksum fails in accordance with the existing download retry logic. This is largely in response to what appears to be intermittent transport failures at lower levels which we cannot otherwise detect.

  • The default timeout value for the agent to lookup itself in an Ironic deployment has been extended to 600 seconds from 300 seconds. This is to provide better stability for Ironic deployments under heavy load which may be unable to service new requests. This is particularly true when the backing database is SQLite for Ironic due to the limited write concurrency of the database.

  • Fix the logic to detect the right parent device for a given multipath device.

  • Fixes the error handling of multipathd service startup/discovery process. IPA handles both scenario when the multipathd service is already started and the scenario when the service has not been started and in the second scenario IPA will try to start the service. IPA is not checking whether multipathd is running already and not, it will start the multipathd service even it is already running and expects 0 error code even if the service is already running. It has been noticed that with certain combinations of distros and multipathd versions the error code is not 0 when IPA tries to start multipathd when an instance of multipathd is already running. When the expected return code is not 0 that causes an exception and that will cause the multipath device discovery to terminate prematurely and if the selected root device is a multipath device then IPA won’t be able to provision. This fix discards the exception that is caused by the non 0 error code returned by the multipathd startup process. In case there is a genuine issue with the multipath service, that would be caught when the actual multipath device listing command is executed (multipath -ll).

  • Fixes an issue with rebuilding instances on Software RAID with RAIDed ESP partitions.