Linux 7.2 Can Significantly Lower Container Exit/Unmount Latency

([Linux Storage] 58 Minutes Ago Lower Latency Linux 7.2)

Reference: 0001641169
News link: https://www.phoronix.com/news/Linux-72-Container-Exit-Latency
Source link:

A patch series merged for the Linux 7.2 kernel addresses a race condition that can occur when a container is exiting yielding " VFS: Busy inodes after unmount " messages and a possible user-after-free condition. But the patch series also goes further and delivers a very nice optimization to lower the container unmounting latency for environments with heavy I/O load.

Alibaba engineer Baokun Li tracked down the possible race condition when a container exits and addressed it with the now-merged patch. That portion of the work should also be back-ported to current Linux stable kernel series in the near future. What's most exciting though is the additional work that eliminates a global serialization penalty and can lead to much lower container exit/unmount latency.

Christian Brauner summed up the situation in [1]this pull request that is now merged for Linux 7.2:

"Fix a race between cgroup_writeback_umount() and inode_switch_wbs()

When a container exits, a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes after unmount" followed by a use-after-free on percpu counters. There is a window between inode_prepare_wbs_switch() returning true (having passed the SB_ACTIVE check and grabbed the inode) and the subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes the global isw_nr_in_flight counter as non-zero but flush_workqueue() finds nothing queued yet, it returns early - leaving a held inode reference that blocks evict_inodes() and a later iput() that hits freed percpu counters.

The race is closed by covering the window from inode_prepare_wbs_switch() through wb_queue_isw() with an RCU read-side critical section and synchronizing in the umount path. On top of that the now-dead rcu_barrier() left over from the queue_rcu_work() era is removed, and the global synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb in-flight counter plus pin/unpin/drain helpers so umount no longer serializes against switch activity on unrelated superblocks.

Under cgroup writeback churn on a 16 vCPU guest this takes umount latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost of cgroup_writeback_umount() from ~62ms to ~4us per call. The initial race fix is kept separate and minimal so it backports cleanly to stable trees that still queue switches via queue_rcu_work()."

Quite a nice improvement for the unmount latency.

There are also additional benchmark numbers from [2]this patch .

Separately, that same VFS pull request for Linux 7.2 also improves write performance when using the RWF_DONTCACHE flag. Those benchmark numbers and more details within [3]this patch .

[1] https://lore.kernel.org/lkml/20260612-vfs-writeback-v72-d7ca37da4512@brauner/

[2] https://lore.kernel.org/all/20260517142147.3354909-1-libaokun@linux.alibaba.com/

[3] https://lore.kernel.org/lkml/20260511-dontcache-v7-3-2848ddce8090@kernel.org/

News: 0001641169

Linux 7.2 Can Significantly Lower Container Exit/Unmount Latency