Intel IDXD Driver To Better Handle Accelerators In Event Of Hardware Errors
([Intel] 2 Hours Ago
IDXD Reset On Hardware Errors)
- Reference: 0001476059
- News link: https://www.phoronix.com/news/Intel-IDXD-FLR-Reset-HW-Errors
- Source link:
Intel's [1]IDXD driver is what enables the Data Streaming Accelerator (DSA) under Linux as found since Sapphire Rapids as part of Intel's accelerator offerings on their Xeon processors. With patches posted today, the IDXD driver will help the hardware recover in case of errors to provide a more robust experience.
Patches posted today on the Linux kernel mailing list enable the Intel IDXD driver to perform a PCIe Function Level Reset (FLR) when the Data Streaming Accelerator(s) hit a hardware error. The FLR reset allows for more robust recovery compared to the status quo of just printing an error when such a problem occurs.
[2]
The " [3]enable FLR for IDXD halt " patch series explains:
"When IDXD device hits hardware errors, it enters halt state and triggers an interrupt to IDXD driver. Currently IDXD driver just prints an error message in the interrupt handler.
A better way to handle the interrupt is to do Function Level Reset (FLR) and recover the device's hardware and software configurations to its previous working state. The device and software can continue to run after the interrupt.
This series enables this FLR handling for IDXD device whose WQs are all user type. FLR handling for IDXD device whose WQs are kernel type will be implemented in a future series."
These IDXD patches are now under review and will hopefully be picked up for a forthcoming kernel series... With the Linux v6.11 merge window just a week or two away, it remains to be seen if these patches will be deemed ready by then or will be pushed off to a later kernel version.
[1] https://www.phoronix.com/search/IDXD
[2] https://www.phoronix.com/image-viewer.php?id=intel-accelerators-linux&image=spr_accelerator_4_lrg
[3] https://lore.kernel.org/lkml/20240705181519.4067507-1-fenghua.yu@intel.com/
Patches posted today on the Linux kernel mailing list enable the Intel IDXD driver to perform a PCIe Function Level Reset (FLR) when the Data Streaming Accelerator(s) hit a hardware error. The FLR reset allows for more robust recovery compared to the status quo of just printing an error when such a problem occurs.
[2]
The " [3]enable FLR for IDXD halt " patch series explains:
"When IDXD device hits hardware errors, it enters halt state and triggers an interrupt to IDXD driver. Currently IDXD driver just prints an error message in the interrupt handler.
A better way to handle the interrupt is to do Function Level Reset (FLR) and recover the device's hardware and software configurations to its previous working state. The device and software can continue to run after the interrupt.
This series enables this FLR handling for IDXD device whose WQs are all user type. FLR handling for IDXD device whose WQs are kernel type will be implemented in a future series."
These IDXD patches are now under review and will hopefully be picked up for a forthcoming kernel series... With the Linux v6.11 merge window just a week or two away, it remains to be seen if these patches will be deemed ready by then or will be pushed off to a later kernel version.
[1] https://www.phoronix.com/search/IDXD
[2] https://www.phoronix.com/image-viewer.php?id=intel-accelerators-linux&image=spr_accelerator_4_lrg
[3] https://lore.kernel.org/lkml/20240705181519.4067507-1-fenghua.yu@intel.com/
phoronix