A tale about fixing eBPF spinlock issues in the Linux kernel

A tale about fixing eBPF spinlock issues in the Linux kernel

148

20h

by y1n0

Veserv

How is this a kernel issue? The code that deadlocked was entirely written by Superluminal who grabbed a shared lock from a interrupt handler. Not doing that is literally the very first lesson of writing interrupt handlers and if you do not know that you have no business doing so.

The only way this could be considered a issue is that it appears that the Linux kernel added the rqspinlock which is supposed to automatically detect incorrect code at runtime and kind of “un-incorrect” it. That piece of code did not correctly detect callers who were blindly using it incorrectly in ways that the writers probably expected to detect.

However, this entire escapade is absurd. Not only does this indicate that eBPF has gotten extensions that grossly violate any concept of sandboxing that proponents claim, I do not see how you can effectively program in the rqspinlock environment. Any lock acquire can now fail with a timeout because some poorly written eBPF program decided that deadlocks were a enjoyable activity. Every single code path that acquires more than one lock must be able to guarantee global consistency before every lock acquire.

For instance, you can not lock a sub-component for modification and then acquire a whole component lock to rectify the state since that second lock acquire may arbitrarily fail.

Furthermore, even if you do that all it does is turn deadlocks due to incorrect code into incredibly long multi-millisecond denials of service due to incorrect code. I mean, yes, bad is better than horrible, but it is still bad.

legedemon

17h

Thanks for the great write-up with links to many more interesting articles and code! I have long stopped working on Linux kernel but deep dives like these are very exciting reading.

benibr

Awesome story, thank you for sharing this in such great detail!

squirrellous

Great post!

The minimized repro seems like something many other eBPF programs will do. This makes me wonder why such kernel issues weren’t found earlier. Is this code utilizing some new eBPF capabilities in recent kernels?

rovarma

Thanks!

The new spinlock that the problem is in was introduced in kernel 5.15, which is relatively new, you need to be hooking context switches, and you need to be sampling at a high enough frequency that you hit the problem, and you need to be using the ring buffer to emit those events. Outside of CPU profilers like us, I don't think there are many other eBPF applications with this type of setup.

alecco

Good writeup.

It is very confusing how Linux source code has macros with names that make them look like functions. At first view it looks like "flags" is passed uninitialized, but it's a temporary save variable used by a macro. Sigh.

sidkshatriya

14h

Excellently explained writeup. Kudos on explaining the shockingly multiple kernel bugs in a (a) simple (b) interesting way.

TL;DR the main issue arises because the context switch and sampling event both need to be written to the `ringBuffer` eBPF map. sampling event lock needs to be taken in an NMI which is by definition non-maskable. This leads to lock contention and recursive locks etc as explained when context switch handler tries to do the same thing.

Why not have context switches write to ringBuffer1 and sampling events write to ringBuffer2 (i.e. use different ringBuffers). This way buggy kernels should work properly too !?

rovarma

12h

> Why not have context switches write to ringBuffer1 and sampling events write to ringBuffer2 (i.e. use different ringBuffers)

That would work, but at the cost of doubling memory usage, since you then have two fixed-size ring buffers instead of one. Also, in our particular cases, the correct ordering of events is important, which is ~automatic with a single ring buffer, but gets much trickier with two.

> This way buggy kernels should work properly too !?

We have a workaround for older/buggy kernels in place. We simply guard against same-CPU recursion by maintaining per-CPU state that indicates whether a given CPU is currently in the process of adding data to the ring buffer. If that state is set, we discard events, which prevents the recursion too.

stupefy

10h

It is a fantastic write up

Reed10119039

[flagged]

bubblerme

12h

[flagged]

peyton

10h

Please don’t generate gobbledegook. Or at least try harder at it. How about a fun persona. Maybe a bit of a backstory.

jacquesm

Or better yet, just fuck off.

Boulos00191

observability is underrated. you can't fix what you can't see

jamesvzb

11h

kubernetes makes this 10x more complicated than it needs to be

hanikesn

How is this related to kubernetes?

nickmonad

Spam bots.

Crafted by Rajat

Source Code

hckrnws

A tale about fixing eBPF spinlock issues in the Linux kernel