News: 0001476059

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Intel IDXD Driver To Better Handle Accelerators In Event Of Hardware Errors

([Intel] 2 Hours Ago IDXD Reset On Hardware Errors)


Intel's [1]IDXD driver is what enables the Data Streaming Accelerator (DSA) under Linux as found since Sapphire Rapids as part of Intel's accelerator offerings on their Xeon processors. With patches posted today, the IDXD driver will help the hardware recover in case of errors to provide a more robust experience.

Patches posted today on the Linux kernel mailing list enable the Intel IDXD driver to perform a PCIe Function Level Reset (FLR) when the Data Streaming Accelerator(s) hit a hardware error. The FLR reset allows for more robust recovery compared to the status quo of just printing an error when such a problem occurs.

[2]

The " [3]enable FLR for IDXD halt " patch series explains:

"When IDXD device hits hardware errors, it enters halt state and triggers an interrupt to IDXD driver. Currently IDXD driver just prints an error message in the interrupt handler.

A better way to handle the interrupt is to do Function Level Reset (FLR) and recover the device's hardware and software configurations to its previous working state. The device and software can continue to run after the interrupt.

This series enables this FLR handling for IDXD device whose WQs are all user type. FLR handling for IDXD device whose WQs are kernel type will be implemented in a future series."

These IDXD patches are now under review and will hopefully be picked up for a forthcoming kernel series... With the Linux v6.11 merge window just a week or two away, it remains to be seen if these patches will be deemed ready by then or will be pushed off to a later kernel version.



[1] https://www.phoronix.com/search/IDXD

[2] https://www.phoronix.com/image-viewer.php?id=intel-accelerators-linux&image=spr_accelerator_4_lrg

[3] https://lore.kernel.org/lkml/20240705181519.4067507-1-fenghua.yu@intel.com/



phoronix

A novice asked the master: "In the east there is a great tree-structure
that men call 'Corporate Headquarters'. It is bloated out of shape with
vice-presidents and accountants. It issues a multitude of memos, each saying
'Go, Hence!' or 'Go, Hither!' and nobody knows what is meant. Every year new
names are put onto the branches, but all to no avail. How can such an
unnatural entity exist?"
The master replies: "You perceive this immense structure and are
disturbed that it has no rational purpose. Can you not take amusement from
its endless gyrations? Do you not enjoy the untroubled ease of programming
beneath its sheltering branches? Why are you bothered by its uselessness?"
-- Geoffrey James, "The Tao of Programming"