Traffic sniffing with eBPF (Part II)

In the previous part, Noveo Senior Developer Kirill talked about how he built a prototype for traffic sniffing in the Linux kernel. We left off with a working prototype capable of intercepting packets and sending them to a remote server. Now it's time to tell you how he improved the solution, what problems came up, and how he eventually solved them. And yes, spoiler alert — it all worked out!

In this part, we'll cover the following questions:

  • Why manually fragment packets?
  • How to store large chunks of data with a small stack?
  • What's going on inside the eBPF code verifier's head, and how do you get it to accept your code?

Packet Fragmentation

When transmitting traffic over a network, there is a limitation on the physical interface regarding the packet length, known as the MTU. This is done to improve data transmission efficiency. A packet that exceeds the MTU (usually 1514 bytes including the L2 header) will either be fragmented by the operating system or dropped. If, at the eBPF program level, we attempt to place a larger packet into the channel, it will simply be dropped and never reach its destination. In our eBPF program, we are adding additional headers for tunneling — that is, we are increasing the packet size. And if a packet was already barely fitting within the MTU, after our expansion it will no longer pass through and will be dropped. Which means we can no longer claim that our sniffer is intercepting every single packet. We need to fix this!

How to Preserve a Packet Before Fragmentation

The problem of having to manually fragment packets is compounded by the fact that the list of target environments for deploying our sniffer includes Kubernetes clusters, where the network, due to its virtual nature, can forward packets of any size — up to 65535 bytes. Here we encounter the first serious limitation of an eBPF program: the stack size is limited to 512 bytes. This means we cannot create local storage even for an ordinary packet, let alone the largest possible one. Fortunately, there are options: create a static byte array or use yet another map. I chose the second option, because the contents of a map can be inspected from userspace, which significantly simplifies debugging.

Dev Humor Corner 1

Dev Humor Corner 1

So, we create a separate entry in the map for each interface from which we capture traffic, and voilà — our cache is ready. I also want to note that all operations in our eBPF program are executed on a single processor, meaning we don't have to worry about parallelization. eBPF itself does provide some capabilities for working with multiple processors, but I personally haven't touched on that, so I'll just mention that such an option exists.

Working with the eBPF Code Verifier

And now we've arrived at what is arguably one of the most challenging aspects of eBPF program development — tailoring your code to a form that the verifier will actually allow into the kernel. If you think you'll move straight to testing your program's logic after a successful compilation, you may be in for a rude awakening. Standing at the kernel's doorstep is the strictest of guardians, one that will dissect your program into every conceivable and inconceivable scenario just to make sure your code won't cause any trouble. Even minor changes — like switching a variable type from uint32_t to int — can lead to hours of wrestling with this stern overseer of your code. Let's go over a few typical problems and how to solve them.

The for Loop

One of the most fundamental constructs in any programming language is the loop. Yet even something as simple as this was considered a luxury for developers of early‑version eBPF programs — loops wouldn't even compile unless you added a special directive:

#pragma unroll

which unrolls the loop into a sequence of instructions. Nowadays things have become somewhat easier, and we can now write:

int i, max = 10;
for (i = 0; i < max; i++) { do_job(i); }

and actually get our loop. However, this seemingly trivial snippet harbours two whole potential problems (in truth I've greatly simplified it, so deliberately triggering them won't be that straightforward):

  1. There is a chance the verifier will reject your code and output something like:
The sequence of 8193 jumps is too complex.
processed 191810 insns (limit 1000000) max_states_per_insn 4 total_states 1730 peak_states 1730 mark_read 11
-- END PROG LOAD LOG --

Let's start figuring out what it's trying to tell us. Too many instructions—the verifier can't process them all because they've exceeded its limits. It turns out there's a bound on the complexity of eBPF programs. The verifier must traverse every possible execution path of the code and make sure that none of them will cause trouble. All in all, that's an understandable requirement for loading code into the OS kernel. It's no surprise that there are certain limits so the verifier can handle the program in a reasonable amount of time. But why did our perfectly ordinary loop turn out to be too complex for the verifier? We only want to do 10 iterations, and while our do_job() isn't a one‑liner, it's clearly not some enormous piece of code!

Let's dig in. Of course, I already know the correct answer, but I'll give you a couple of hints first. The verifier will happily accept the following code:

do_job(0);
do_job(1);
...
do_job(9);

Moreover, it won't see any problems with this version either:

char i;
const int max = 10;
for (i = 0; i < max; i++) { do_job(i); }

For those who haven't figured it out — I'll explain now. As I said, the verifier examines every possible execution path of the code in search of problems.

Dev Humor Corner 2

Dev Humor Corner 2

And from its perspective, the declaration int max = 10; is sufficient reason to, every time max is used in the code, check the variable against all possible values (i.e., from -2,147,483,648 to 2,147,483,647). Of course, this doesn't mean the verifier is completely stupid — you won't break it with a couple of ints — but it will still check the extreme values, even if there's not even a hint in the code that max ever changes. In other words, the verifier takes the word “variable” literally. If it could theoretically change (even though nobody touches it in the code), it must be checked.

Hence, my two recommendations for tackling this problem:

  • Make everything you can constant. Write const int max = 10, and the verifier will be certain that nothing other than 10 will ever be in max. Also, don't hesitate to use #define MAX 10 — that works well too.
  • If we really need a variable, reduce its type size to the minimum. In the example above, we only need 10 iterations, so for i a one-byte char (range -128 to 127) is perfectly sufficient.
  • It might happen that we don't know the size of max at compile time either. In that case, I recommend defining the maximum possible number of iterations based on your code's logic and writing something like:
char i;
uint16_t max = N; // some variable number

const int REAL_MAX = 15;
for (i = 0; i < max; i++)
{
    if (i >= REAL_MAX)
        break;
    do_job(i);
}

This way we can convey to the verifier that, despite the variable nature of max, we will limit ourselves to a constant number of iterations.

2. The second problem is related to code extensibility.

It's a perfectly normal situation when we want to change the number of iterations and do something like this (taking into account the recommendations from the previous section):

char i;
const int max = 100;
for (i = 0; i < max; i++) { do_job(i); }

We only increased the number of iterations, and the verifier throws this at us again:

The sequence of 8193 jumps is too complex.
processed 262333 insns (limit 1000000) max_states_per_insn 4 total_states 1931 peak_states 1931 mark_read 14
-- END PROG LOAD LOG –

Of course, we won't run into this every time, but I did. Once again, complexity issues — and this time we can't reduce it using the methods we've already covered. Well, we just need to do these 100 iterations, period! If your Linux kernel version is lower than 5.17, then you'll have to come up with something on your own. If the version is fine, however, you can use the bpf_loop function, which takes a callback, a context, and the number of passes. Congratulations — now you have a loop with 8 million iterations at your disposal! We don't need that many, but it's nice to know such an option exists. Let's rewrite our loop like this:

long do_job(u64 index, void *ctx)
{
// return 1 as breapoint for loop
    ...
    return 0;
}
...
const int max = 100;
void *callback_ctx = NULL;
bpf_loop(max, do_job, callback_ctx, 0);

Now we know how to write loops that the verifier will accept. Moving on!

Handling a Large Map

So, I decided to store the original packet before fragmentation in a large map. The maximum network packet size is 65535 bytes, so without much thought, I created a map with the interface index as the key and a 65535‑byte array as the value. The fragmentation algorithm itself isn’t particularly interesting — we use the already familiar bpf_skb_adjust_room for IP packets and the slow bpf_skb_change_tail for the rest, reducing the packet until it fits into the required MTU. Then we simply walk through the packet stored in the map, take chunk after chunk, and place them into our little packet, not forgetting to attach the necessary headers. Here, by the way, it's worth mentioning that we cannot work directly with the memory of either the map or the packet — we might break something. Therefore, we must use bpf_skb_load_bytes for reading and bpf_skb_store_bytes for writing.

Although the described algorithm requires careful calculations of exactly where and what you're copying (don’t forget about the need to align data in the IP packet during fragmentation), it is conceptually clear and doesn’t raise the question “how do I make this work?”. However, I'll now show a simplified example of fairly trivial code that ate up quite a lot of my time.

// Check that the header and data size will fit into the packet

if (frag_payload_size + frag_header_size <= reduced_packet_len)
{
    // calculate where the end of the fragment in the map will be
    // on the current iteration
    __u16 end = frag_payload_size * (i + 1);    

    // perform checks that we are not trying to write 0 bytes (forbidden)
    // and that we won't go out of bounds of the map array
    if(frag_payload_size > 0 && (end - frag_payload_size) >= 0)
    {
        // write to the packet at an offset equal to the header size
        // (because we're writing data), the data is taken from packet_map
        // at a previously verified offset. The size of the data being read
        // has also been verified beforehand, i.e., we are guaranteed not to
        // go out of the [0; end] range
     if (bpf_skb_store_bytes(ctx, frag_header_size, &packet_map[end - frag_payload_size], frag_payload_size, BPF_F_INVALIDATE_HASH) < 0)
            goto drop;
    }
}

I've more or less explained in the comments what the code does and why I consider it working. Yet, when trying to load it into the kernel, the verifier complains:

if (bpf_skb_store_bytes(ctx, frag_header_size, &packet_map[end - frag_payload_size], frag_payload_size, BPF_F_INVALIDATE_HASH) < 0)
731: (bf) r1 = r6                     ; R1_w=ctx() R6=ctx()
732: (b7) r2 = 34                     ; R2_w=34
733: (b7) r5 = 2                      ; R5_w=2
734: (85) call bpf_skb_store_bytes#9
R4 invalid zero-sized read: u64=[0,1120]
processed 108955 insns (limit 1000000) max_states_per_insn 10 total_states 1714 peak_states 388 mark_read 74
-- END PROG LOAD LOG --

In reality, the error log is much larger and consists of pseudo‑assembly, which is quite fascinating to decipher. Nevertheless, the main piece of information can be extracted from it — somewhere the verifier found a scenario where the value of frag_payload_size became zero. I managed to overcome this by adding the following code:

__u16 frag_payload_size_local = FRAG_PAYLOAD_SIZE_MIN;
if (frag_payload_size > frag_payload_size_local)
    frag_payload_size_local = frag_payload_size;

and replacing frag_payload_size with frag_payload_size_local in the previous snippet. Now the verifier knows that frag_payload_size_local will definitely never be zero.

Dev Humor Corner 3

Dev Humor Corner 3

However, this isn't the last problem. We convinced the verifier that we would read a non-zero value, yet when we try to load the code, we see a new error:

; if(frag_payload_size_local > 0 && (end - frag_payload_size_local) >= 0)
720: (6d) if r8 s> r1 goto pc+13      ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=0xf5df,var_off=(0x0; 0xfff8)) R8_w=0
721: (67) r1 <<= 32                   ; R1_w=scalar(smin=smin32=0,smax=umax=0xf5df00000000,smax32=umax32=0,var_off=(0x0; 0xfff800000000))
722: (77) r1 >>= 32                   ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=0xf5df,var_off=(0x0; 0xfff8))
; if (bpf_skb_store_bytes(ctx, frag_header_size, &interface_cache->data[end - frag_payload_size_local], frag_payload_size_local, BPF_F_INVALIDATE_HASH) < 0)
723: (79) r3 = *(u64 *)(r10 -96)      ; R3_w=map_value(map=ifaces_cache,ks=4,vs=65535) R10=fp0 fp-96=map_value(map=ifaces_cache,ks=4,vs=65535)
724: (0f) r3 += r1                    ; R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=0xf5df,var_off=(0x0; 0xfff8)) R3_w=map_value(map=ifaces_cache,ks=4,vs=65535,smin=smin32=0,smax=umax=smax32=umax32=0xf5df,var_off=(0x0; 0xfff8))
; if (bpf_skb_store_bytes(ctx, frag_header_size, &interface_cache->data[end - frag_payload_size_local], frag_payload_size_local, BPF_F_INVALIDATE_HASH) < 0)
725: (bf) r1 = r6                     ; R1_w=ctx() R6=ctx()
726: (b7) r2 = 34                     ; R2_w=34
727: (bf) r4 = r7                     ; R4_w=scalar(id=2559,smin=umin=smin32=umin32=1121,smax=umax=smax32=umax32=9152,var_off=(0x0; 0x3ff8)) R7=scalar(id=2559,smin=umin=smin32=umin32=1121,smax=umax=smax32=umax32=9152,var_off=(0x0; 0x3ff8))
728: (b7) r5 = 2                      ; R5_w=2
729: (85) call bpf_skb_store_bytes#9
invalid access to map value, value_size=65535 off=62943 size=9152
R3 max value is outside of the allowed memory range
processed 72877 insns (limit 1000000) max_states_per_insn 10 total_states 768 peak_states 356 mark_read 74
-- END PROG LOAD LOG --

I've provided a slightly larger chunk of the log to show how to extract information from it. Here's an example of a step-by-step breakdown:

1. invalid access to map value, R3 max value is outside of the allowed memory range — from these lines we understand what the problem is, namely — reading beyond the bounds of the memory region.

2. Look at where this occurred. We see bpf_skb_store_bytes, its arguments, and pinpoint the exact location in the code.

3. Figure out exactly how far we're going out of bounds. value_size=65535 off=62943 size=9152, i.e., the data size in the map is 65535, the offset (the value of end - frag_payload_size_local) is 62943, and the size of data being read (frag_payload_size_local) is 9152.

4. Try to understand where the verifier got these numbers. 65535 we set ourselves, that's clear. Now 62943 is 0xF5DF, and this number appears right here:
R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=0xf5df,var_off=(0x0; 0xfff8))
and it is derived from the condition if(frag_payload_size_local > 0 && (end - frag_payload_size_local) >= 0). In fact, the log contains many more lines through which you can trace all the chains of assumptions from which the verifier deduces such numbers. It deduces the number 9152 in exactly the same way.

5. Analyze your code, calculate the possible values, and realize that overall the verifier is correct — these expressions truly do have exactly these extreme values, however they simply cannot take on those values simultaneously. Here's a simplified example:

int a, b, c;
c = 100;
a = random(0, 100);
b = c - a;

We understand that a and b can take values from 0 to 100, but they simply cannot both be 100 at the same time. Yet for the verifier, this is far from obvious. It stores the values for variables and computed expressions in different registers, assigning them bounds like smin=smin32=0, smax=umax=smax32=umax32=0xf5df. And subsequently, it operates on these exact values, calculating new bounds for expressions. Incidentally, this is precisely the moment when I encountered what felt like if statements not working. Let's take the previous example and add to it:

if (c - a > 50)
{ use_somehow(c - b); }

Based on the preceding code, we know that if the condition is true, then c - b will be less than 50. Yet the verifier does not see it that way. It stores exactly c - a, and if you write c - a - 1, that's already something entirely new as far as it's concerned.

So, the problem is clear, but how do we deal with it? There is no universal answer — I can only recommend creating as many conditions as possible that constrain the range of possible values. In the code I was using to work with the map, I failed to find a set of conditions that would satisfy the verifier, so I simply increased the size of the array in the map. I added 62943 and 9152 and set the resulting number as the new size. This completely satisfied the verifier, and I ended up with a few extra kilobytes allocated for the map. Not an ideal solution, but a perfectly acceptable one.

In Conclusion

I had to solve the problems described not just once or twice during my work on this project. Besides creating the first version, there were also new features that turned previously constant parameters into variables, which would kick off another round of wrestling with the verifier. In the end, I can say that developing complex programs in eBPF is no simple task and is not always worth the effort. Any change can lead to hours or even days of debugging, and these time costs are extremely difficult to predict. So, wherever possible, complex operations should be moved to userspace. Still, I'm glad I got this experience.

In the next part, I'll talk about a couple more tricks for working with eBPF, but this time from an architectural perspective, and I'll also discuss the final solution and its specifics.