From Junior to Genius: an optimization story

Refactoring a checksum until it runs 100 times faster

Israel Lot
ITNEXT

--

Running lanes
Prepare to be challenged

Recently, I came across a hash method sitting in a hot code path that had been refactored several times over the years by different developers — an interesting Git history to read. Each bringing their own set of tools and knowledge, applying different techniques and choosing different compromises.

Observing a method evolve over time is a good way of realizing how wide the performance spectrum can be, even for simple routines. That’s especially the case when dealing with high level languages — with lots of moving parts between source and compiled code, opportunities to cut execution time are everywhere, sometimes they are obvious, sometimes not so much.

The challenge

We are going to take a simple checksum method and hand it to six different developers, named Junior, Pro, Senior, Hacker, Expert and Genius. They’ll exhaust their abilities trying to optimize it and we, hopefully, will learn something from each one of them.

Here are the constraints:

  • C# / dotnet 6. Having a JIT compiler makes it more interesting.
  • Single threaded. We are not going to explore parallelization.
  • x86. ARM, WASM and other targets will be left out.

Without further ado, here’s what I call the Junior checksum:

The method is pretty straightforward, accumulating 4 column sums and reducing them to a single uint using the LSB from each accumulator. Given the requirements, this is what I would expect an educated junior developer to deliver.

I know what you’re thinking: Junior just copied this from StackOverflow. If you’re a junior, I’m telling you: that’s okay. Everybody does that to some extent — just be sure you understand what you’re copying.

Here comes the Pro

So now our little checksum gets handed to Pro. Pro has been around a bit and learned a few tricks, but still can’t be called a senior.

The first thing Pro does is to benchmark Junior’s code to establish a baseline.

The benchmark uses a 1MB buffer filled with random data.

Here are the results:

|   Method |  Length |     Mean |
|--------- |-------- |---------:|
| Junior | 1000000 | 1.799 ms |

Are these numbers good or bad? 1.8ms for 1Mb of data seems pretty fast but the task is clear: make it even faster. After scratching their head for a while, Pro decides to do loop unrolling.

Loop unrolling is a technique that trades code size for performance. The idea is to do more work per loop iteration. So instead of:

for (x = 0; x < 100; x++)
{
work(x);
}

one would write:

for (x = 0; x < 100; x+=5)
{
work(x);
work(x+1);
work(x+2);
work(x+3);
work(x+4);
}

The main reasoning behind it is that in every iteration, there will be a comparison ( x < 100 ), an addition ( x++ ), and a jump. By repeating the statements, the compiled code results in fewer operations, since now the jump and the comparison happen only once in every 5 work() calls.

Most compilers have loop unrolling as one of their optimization tools. Usually if the loop has a fixed low iteration count, you can notice the unrolling taking place at the generated assembly.

Pro’s version of the checksum:

While possible, this version operates on 16 bytes per iteration. After that it does a couple of 4-byte loops and finally adds the last 3, 2 or 1 remaining bytes. These are way more lines of code, but was it worth it?

|   Method |  Length |       Mean |     Median | Ratio |
|--------- |-------- |-----------:|-----------:|------:|
| Junior | 1000000 | 1,855.5 us | 1,825.4 us | 1.00 |
| Pro | 1000000 | 373.8 us | 373.8 us | 0.20 |

The comparison benchmark screams yes. It’s 5 times faster than Junior’s.

Do years on the road make a difference?

We hand the task now to Senior, with the same simple mandate: make it faster.

Senior has been around for quite a while, actually for so long that when they first started programming, C# was a joke. Senior wrote a lot of code in C, knows pointers and understands assembly code.

Notice some of the Gists only display the most relevant loop in the solution. At the end of this article there’s a link to a Github repository you can fork and play.

The updated benchmark, a 1.6x improvement over Pro’s and 7.8x over Junior’s:

|   Method |  Length |       Mean | Ratio |
|--------- |-------- |-----------:|------:|
| Junior| 1000000 | 1,803.0 us | 1.00 |
| Pro | 1000000 | 365.2 us | 0.20 |
| Senior | 1000000 | 231.6 us | 0.13 |

The code looks a lot like Pro’s, so let’s break down the changes and understand what’s going on.

1. Marking the method as unsafe

public static uint ChecksumPro(ReadOnlySpan<byte> arr)
public unsafe static uint ChecksumSenior(ReadOnlySpan<byte> arr)

The first edit is adding the unsafe keyword to the signature. This allows us, for instance, to use pointers inside the method. In C# world, it’s a statement the developer makes that translates to “I know what I’m doing, compiler, be cool.”

2. Actually acquiring a pointer

fixed (byte* ptr = arr)

The most common way to use a pointer in C# is inside a fixed scope. In C#’s documentation’s own words:

The fixed statement prevents the garbage collector from relocating a moveable variable and declares a pointer to that variable.

Basically, it ensures the pointer we get remains valid and the CLR won’t move the underlying data around. This kind of concern only happens here because we are operating in a managed memory environment. The CLR adds an abstraction layer between what we think of as variables and the actual memory address at which are located. (Well, to be fair, the OS also adds an abstraction layer, offering processes virtual memory, but that’s out of scope for us here). Every abstraction invariably introduces overhead. That’s the price developers pay for the convenience of a garbage collector. To use pointers in C# is to temporarily bypass that abstraction.

From here on the code looks almost exactly the same, but instead of indexing an array, it indexes the pointer:

arr[z + 0] //indexed array 
ptr[z + 0] //indexed pointer

3. Keeping an eye on ASM

At the end of the day, C# code will be translated to machine code for execution. When you reach this level of detail, every instruction removed from the execution path counts. Take these two methods for instance:

The version using unsigned integers has one instruction less. Because in the first method we are using int, the compiler adds a movsxd before every memory access (movzx).

movsxd is Move with Sign-Extension

Basically, it’s converting from a signed integer to an unsigned one in order to index the pointer. You may think that one extra instruction does no harm, but if that happens to be in a tight loop, 6 instructions versus 7 can translate to a 15% difference in performance.

I know what some will jump to say next: you need to consider instructions reciprocal throughput, latency and pipelining to actually make the statement above. True, I’m oversimplifying to make a point.

It’s also worth mentioning that in our case, although displaying X86 ASM code, what actually happens is that compiler outputs IL code which will then be compiled to machine code by the JIT compiler depending on the platform. Statements made here might not be valid for the ARM instruction set, for example.

Besides that, there’s no guarantee the JIT will always output the same ASM instructions. The JIT is an engineering feat, full of heuristics and in constant evolution. Code we write today might compile to a different set of instructions tomorrow because the JIT got updated with new performance tricks. That’s the bright side of working on top of a framework with a just-in-time compiler — our code might run faster without us actually shipping updates. On the downside, it makes performance less predictable.

OK, but what really changed between Pro’s and Senior’s versions? Let’s look at a section of the decompiled ASM from both:

It’s easy to notice a pattern, even without knowing exactly what each ASM instruction is doing:

lea      ebx, [r11+0CH]
cmp ebx, ecx
jae G_M000_IG17
mov ebx, ebx
movzx rbx, byte ptr [rdx+rbx]
add eax, ebx

versus:

lea      ebx, [r11+08H]
movzx rbx, byte ptr [rdx+rbx]
add eax, ebx

Let’s break the first lines down:

lea      ebx, [r11+0CH] //loads data into ebxcmp      ebx, ecx       //compares ebx to ecxjae      G_M000_IG17    //jump to G_M000_IG17 if result from compare
//is above or equal 0
mov ebx, ebx //moves data from address in ebx into ebx
//this is a pointer dereferencing (*ptr)

What is this jump doing here?

If check what’s at label G_M000_IG17, we find:

G_M000_IG17:                ;; offset=0445H
E8E675C15F call CORINFO_HELP_RNGCHKFAIL
CC int3

Alright — it seems it’s a range check failure. Now it becomes clear: the extra instructions are the ones responsible for throwing an exception if the code tries to access an out-of-bounds position of an array. This is what makes the above snippet throw an exception instead of reading undefined data at arr[2]:

var arr = new byte[1] { 0 };
var invalid = arr[2]; // IndexOutOfRangeException thrown

What happens if bounds are not checked, and invalid addresses are accessed? All sorts of nasty things, SEGFAUTS and alike included. Not pretty. This is C# putting training wheels on your bicycle.

Senior’s code, by using the unsafe keyword and pointers, avoids these bound checks in the tight loop. As a result, most of the read/write pattern end up using 3 instructions instead of 6.

The JIT compiler is able to remove bound checks if it can prove the index will never go out of the bounds of an array. Pro made a mistake that, if fixed, would cause the JIT compiler to remove the bound checks and bring its performance closer to Senior’s without using pointers. I encourage you to spot it.

Some would argue that the most important aspect is actually removing all those jumps, since a jump means branching, and branching means a CPU has to do speculative execution and try to predict which path the branch will take. I get it, but for this exercise, let’s just consider fewer instructions equals less work for the sake of simplicity.

Who needs a hacker?

Senior’s solution is great. It takes outside-the-box thinking to move beyond that. We hand it over to Hacker. The task is still the same: make it faster.

We get this:

|   Method |  Length |       Mean | Ratio |
|--------- |-------- |-----------:|------:|
| Baseline | 1000000 | 1,761.0 us | 1.00 |
| Pro | 1000000 | 363.6 us | 0.21 |
| Senior | 1000000 | 224.6 us | 0.13 |
| Hacker | 1000000 | 120.0 us | 0.07 |

A 2x improvement over Senior’s with what may look like garbage to the untrained eye. Some of you readers, however, might have figured out the trick already. Did you? Leave a comment and tell me :)

In the context of our challenge here, the definition of a hacker is:

A person who will find and use tools out of their original purpose in order to achieve a particular goal.

Often when you read source code from really performance-sensitive projects, like the Linux kernel for instance, you come across sections that look like sorcery spells written in ASCII characters. A great exercise is to go understand those bits and break down the techniques used so they become building blocks you can use to compose clever solutions to problems you yourself might be facing further down the road.

The trick here is the repurpose of a long as a 4-wide short vector. Let’s break the relevant section down:

ulong l1 = *(ulong*)(ptr + z);
//1. a=(ptr + z) adds offset z to pointer ptr ( ptr is byte* )
//2. b=(ulong*)(a) casts that pointer to a ulong*
//3. *(b) dereferences the pointer and reads the long value

So this first little trick is reading 8 bytes into a ulong . Assuming we are in a x64 machine, this is a single memory fetch. Then this is done 3 times more:

ulong l1 = *(ulong*)(ptr + z);
ulong l2 = *(ulong*)(ptr + z + 8);
ulong l3 = *(ulong*)(ptr + z + 16);
ulong l4 = *(ulong*)(ptr + z + 24);

At this point we effectively read 4 ulong, or 32 bytes of data, at the cost of 4 instructions versus the 32 needed to read byte by byte.

l1 & 0x00FF00FF00FF00FF;

The next trick is applying a mask to the first long. If we annotate the bytes at l1, on a Little-endian system, it looks like:

B7_B6_B5_B4_B3_B2_B1_B0

And applying that mask we get

  B7_B6_B5_B4_B3_B2_B1_B0 & 0x00_FF_00_FF_00_FF00_FF
= 00_B6_00_B4_00_B2_00_B0

Now what happens if we were to look at this masked long two bytes at a time?

0x00_B6 0x00_B4 x00_B2 0x00_B0

It’s B0, B2, B4 and B4 widened from a single byte word to a 2-wide word. We transformed each byte into a short.

(l1 & 0xFF_00_FF_00_FF_00_FF_00) >> 8

This does the same for the odd bytes, shifting one byte to get them in the right position. Doing that for all the 4 ulongwe read, we now have:

M1= 00_B6__00_B4__00_B2__00_B0
M2= 00_B7__00_B5__00_B3__00_B1
M3= 00_B14_00_B12_00_B10_00_B8
M4= 00_B15_00_B13_00_B11_00_B9
M5= 00_B22_00_B20_00_B18_00_B16
M6= 00_B23_00_B21_00_B19_00_B17
M7= 00_B30_00_B28_00_B26_00_B24
M8= 00_B31_00_B29_00_B27_00_B25

What happens if we sum M1 to M3, M5 and M7? Assuming the individual

  00_B6__00_B4__00_B2__00_B0
+
00_B14_00_B12_00_B10_00_B8
+
00_B22_00_B20_00_B18_00_B16
+
00_B30_00_B28_00_B26_00_B24
= [B6+B14+B22+B30]_[B4+B12+B20+B28]_[B2+B10+B18+B26]_[B0+B8+B16+B24]

Let’s call the short view of those S1, S2, S3 and S4:

S1= [B6+B14+B22+B30] = (M1+M5)& 0xFF_FF_00_00_00_00_00_00 >> 48
S2= [B4+B12+B20+B28] = (M1+M5)& 0x00_00_FF_FF_00_00_00_00 >> 32
S3= [B2+B10+B18+B26] = (M1+M5)& 0x00_00_00_00_FF_FF_00_00 >> 16
S4= [B0+B8+B16+B24] = (M1+M5)& 0x00_00_00_00_00_00_FF_FF

Recalling the previous version, the accumulators were:

sum0 += (B0 + B4 + B8 + B12 + B16 + B20 + B24 + B28)
sum1 += (B1 + B5 + B9 + B13 + B17 + B21 + B25 + B29)
sum2 += (B2 + B6 + B10 + B14 + B18 + B22 + B26 + B30)
sum3 += (B4 + B7 + B11 + B15 + B19 + B23 + B27 + B31)

It’s clear that we can achieve the same accumulation using these bit manipulations.

sum0 = (S2 + S4)
sum2 = (S1 + S3)
...

But what if, instead, we keep summing these longs before merging the shorts into sum0, sum1 etc.?

while(...)
{
tmp1 += M1+M3+M5+M7
...
}

We are now hacking a ulong, repurposing it as a 4-wide shortvector. But there’s a catch: we can’t let the sum of any individual element overflow the max value of an unsigned short 0xFFFF.

We are adding 4 bytes to each short at each iteration, so we can do that safely 64 times before we reach 0xFF.

0xFF * 4 *64 = 0xFF00

Past that point, any of the shorts could overflow and our hacked vector, corrupted. So, once we do 64 iterations, we extract each of the vector's value and add it to the original sums, resetting the vectors. This is exactly what this section is doing:

if (limit2 != 64)
sum0= ...
sum1= ...
sum2= ...
sum3= ...
}

Here’s what the ASM code looks like for the tight loop:

And here is an expanded and annotated version with intermediate variables that might be easier to follow:

Why is an expert called an expert?

One the main characteristics I use to describe an expert is an ability to analyze problems from their fundamentals. You see, all our fellow developers bulldozed through the task, dumping all their knowledge on top of their predecessors. None of them questioned the reasoning — they just improved upon it using the tools they had.

Our Expert looks back to the first version of the method:

//Compute a 32-bit big-endian checksum on arr

The clue here is the fact this is a big-endian checksum. This could be figured out by understanding the code itself, but this comment line has been there all along, stating the fundamental point of this method.

As a recap, Endianness refers to the order in which bytes appear in a word. For numbers, it refers to the first byte being MSB or LSB.

From Wikipedia

The majority of systems today are Little-endian, but what would happen if we were actually in a Big-Endian system? The method’s core loop could be as simple as:

while (z < limit){
sum += *(uint*)(ptr + z);
z += 4;
}

This is exactly a 32-bit sum of the input array. But because we are in Little-endian, when reading 4 bytes their order is reversed compared with what it should be for the simple sum to work. All that is needed is to read an int, reverse their bytes, and sum.

Finally, here’s Expert’s version:

|         Method |  Length |        Mean | Ratio |
|--------------- |-------- |------------:|------:|
| Junior | 1000000 | 1,785.46 us | 1.00 |
| Pro | 1000000 | 362.37 us | 0.20 |
| Senior | 1000000 | 224.11 us | 0.13 |
| Hacker | 1000000 | 120.26 us | 0.07 |
| Expert | 1000000 | 79.83 us | 0.04 |

Almost twice as fast as Hacker’s, with the added benefit of readable code.
This is the line that addresses the fundamental issue with all the other solutions:

sum += BinaryPrimitives.ReverseEndianness(*(uint*)(ptr + z));

It reverses the byte ordering for each of the 4 bytes read. This translates to a single ASM instruction

movbe    r11d, dword ptr [rdx+r10+08H]

Here’s the definition of MOVBE: Move Data After Swapping Bytes. It reads and swap bytes in one shot. Another instruction for this purpose would be BSWAP, which operates in place in a register. So now Expert showed us we can operate across endianness without adding extra instructions.

Expert 2.0 with SIMD

SIMD stands for Single Instruction Multiple Data, and I know a lot of you readers were expecting to see a SIMD solution. The key point is that SIMD instructions allow us to operate on more bytes at a time. Our checksum seems like a perfect candidate for such vectorization.

This is a rough timeline of availability of SIMD extensions on Intel CPUs. We are going to focus on AVX and AVX2, which are respectively 128 and 256 bits wide. Targeting AVX and AVX2 means targeting any CPU made from 2011 onwards.

Dotnet introduced Hardware Intrinsics back in dotnet core 3. You can read more about it here: Hardware Intrinsics in .NET Core — .NET Blog (microsoft.com). Currently it supports up to AVX2. AVX512 should eventually be available as well.

Let’s look at the AVX solution first:

The star function here is Avx.Shuffle, which compiles to vpshufb. This instruction allows us to reorder and mask bytes in a single shot by providing a mask argument. We crafted a mask that will swap endianness for each 4-byte block.

It then reinterprets the vector as a vector of unsigned intand adds it to an accumulator vector. In ASM, this looks like:

vmovdqu  xmm2, xmmword ptr [rdx+r9]
vpshufb xmm2, xmm2, xmmword ptr [reloc @RWD00]
vpaddd xmm1, xmm1, xmm2
RWD00 dq 0405060700010203h, 0C0D0E0F08090A0Bh

Now using only 3 instructions, we were able to process 16 bytes. How fast was it?

|    Method |  Length |        Mean | Ratio |
|---------- |-------- |------------:|------:|
| Junior | 1000000 | 1,774.14 us | 1.00 |
| Pro | 1000000 | 359.41 us | 0.20 |
| Senior | 1000000 | 223.97 us | 0.13 |
| Hacker | 1000000 | 117.98 us | 0.07 |
| Expert | 1000000 | 78.27 us | 0.04 |
| ExpertAvx | 1000000 | 30.62 us | 0.02 |

Not bad! Another 2x improvement. We are now, thanks to Expert, approximately 57 times faster than Junior.

The next step is to move from 128-wide vectors to 256-wide vectors. The implementation is pretty similar but there are some gotchas we need to pay attention to. For instance, for most operations, a Vector256 is actually considered to be two lanes of 128-wide ones. That’s out of scope here — let’s see if we can finally get to the 100x faster we were hoping to:

|    Method  |  Length |        Mean | Ratio |
|----------- |-------- |------------:|------:|
| Junior | 1000000 | 1,774.14 us | 1.00 |
| Pro | 1000000 | 359.41 us | 0.20 |
| Senior | 1000000 | 223.97 us | 0.13 |
| Hacker | 1000000 | 117.98 us | 0.07 |
| Expert | 1000000 | 78.27 us | 0.04 |
| ExpertAvx | 1000000 | 30.62 us | 0.02 |
| ExpertAvx2 | 1000000 | 20.83 us | 0.02 |

Around 85 times faster than Junior. Almost there!

Sometimes there’s a wall in the way

It’s important to notice that we went from processing 563MB/s to 48GB/s. At these rates, the bottleneck becomes memory access and not the CPU.

Let’s pause and think for a bit. The system I’m using has DDR4 memory, which according to Wikipedia shouldn’t be able to do better than 25GBs. So how is it possible for our benchmark to run at 48GB/s?

The only reason we are observing these rates is because our test data fits entirely into the CPU’s L3 cache, which in the case of the Intel i7 10875H I’m using here has 16MB. The L3 cache sits much closer to CPU cores and can peak at 200GB/s, but that will vary a lot depending on the CPU. If our data were to fit into L2 or even L1 we would see even higher numbers.

Running the same benchmark with a 100MB array should expose us to RAM bottlenecking, and indeed:

|     Method |    Length |       Mean | Ratio |     Bandwidth |
|----------- |---------- |-----------:|------:|--------------:|
| Junior | 100000000 | 180.620 ms | 1.00 | 555.55 MB/s |
| ExpertAvx2 | 100000000 | 5.7772 ms | 0.03 | 17320.00 MB/s |

At 17GB/s, it actually meets closely the specs of the particular memory modules I have. If we were to read the file from storage, we would be bound by disk speeds, not RAM, not L3 cache and definitely not the CPU.

That being said, as a software engineer, it’s important to understand that performance numbers derive from various factors in a system, and code is important but it’s only one of the components. Understanding the low-level components of your target system may help you plan how to arrange and move memory more efficiently.

What about parallelization over multiple cores?

I initially stated this is something we would not explore. The reason should be obvious now. Besides the RAM bottleneck, L3 cache is shared among all cores in a typical CPU, so it is kind of pointless.

Can a genius take us any further?

Constraints considered — I know there’s room to squeeze more from the solutions presented here in this article.

Here’s a repo you can clone and submit a pull request with your own version:

github.com/israellot/checksum-challenge

I’ll update this article with the best ones. I’m curious to see what you’ll come up with.

--

--