Geting 4x Speedup With .NET Core 3.0 SIMD Intrinsics

Published in

ITNEXT

4 min readJul 31, 2019

A few weeks ago Den Raskovalov and I had a fancy conversation on C# performance, which turned into a tiny but fun coding exercise. The statement to prove or disprove was:

“The C++ code listed below, being translated to C#, will never be close to C++ in terms of speed.”

Assuming that:

We keep it single threaded
The file it reads is presumably short enough (~ 1 GB or less) to be cached in RAM — i.e. IO bandwidth won’t be the bottleneck here.

If you don’t want to dig into the code, here is what it does:

The file stores a list of integer numbers in (0 … 1,000,000) range encoded using Variable-Length Quantity coding scheme. Compute the sum of all these integers.
The coding scheme uses the most significant bit (MSB) in each byte to indicate whether there is a continuation of the current number’s bit sequence (MSB=0) or not (MSB=1). Here is an example of similar ways to encode integers.

So we started to improve our versions of code computing that sum to make it as fast as possible while honoring the rules stated above. Den was playing on C++ side.

To keep the post short, I’ll focus mostly on C# side here, though you can find both C# and C++ code here (my repository, I’ll keep it updated with the latest versions of both C# and C++ code in case of any future changes) and here (Den’s repository, most up-to-date C++ code is there).

The original C# code computing the sum:

High-level wrapper reading the file is common here; it gets a delegate pointing to the specific implementation of the computation logic (i.e. what has to be optimal).

Performance on Core i7–8700K, Ubuntu 19.04, gcc or .NET Core 3.0 preview 5 and ~1GB file with test data:

~ 430ms for C++
~ 500ms for C#

First round of optimization was relatively straightforward:

C++: enable a set of optimizations via compiler options (-Ofast -fomit-frame-pointer -march=native -mtune=native -funroll-loops -Wno-shift-count-overflow), enable PGO (profile guided optimization), use memory mapped files
C#: use unsafe pointers, unroll the main loop, add a version relying on async pipelines. I did the last part mostly to test whether there are any benefits — the implementation I used violates the original “single thread” condition since the producer there reads another chunk while the consumer computes the sum; on the other hand, C++ version relying on memory-mapped files is doing something similar implicitly — so in both cases it doesn’t look like a plain violation of our “single-threaded” rule (i.e. we agreed here that file read operations might be concurrent).

Updated C# code:

The results were again pretty similar:

394ms for C++ version
416ms for C# version with async pipeline
462ms for “regular” C# version

I tried to implement a few other ideas — in particular, implemented a version relying almost exclusively on non-branching instructions, but none of them worked any better.

The next round brought almost 5x speedup: Alexey Poyarkov wrote an extremely efficient AVX2-based version of sum computation code. I translated his code to C# line-by-line relying on .NET Core 3.0 SIMD intrinsics and made few cosmetic changes later. That’s how the final version of C# code looks:

The results:

95ms for C++ version
130ms for C# version
113ms for C# version with async pipeline.

Finally, I found out it actually makes a lot of sense to use memory-mapped files in C# version on Ubuntu. I knew from my past experience they don’t provide any benefits on Windows — moreover, quite the opposite is true, so I didn’t even try this as one of the initial optimizations. But seemingly it’s the fastest way to read the file on .NET Core on Ubuntu:

The final standings:

95ms for C++ version
101ms for C# version

Note that these results are nearly perfect: the baseline test computing the sum assuming it’s a sequence of Int64 values takes nearly the same time, so CPU is not a bottleneck for this code anymore — it’s RAM bandwidth. And that’s where we clearly had to stop.

My conclusions:

C# is definitely a fit for similar compute-intensive tasks — especially with .NET Core 3.0
If you run a similar workload at scale, the difference is expected to be even smaller: the CPU is a bottleneck on this test only while it is single-threaded. If we’d run 4 or more similar tasks concurrently (think it’s ~ a web server decoding UTF8 input, etc.), they’d hit RAM bandwidth limit even on non-SIMD version.

And here are the conclusions from Den Raskovalov (I totally agree with him as well):

“I agree with most of the conclusions Alex made, but I would like to stress your attention to a few things:

Low-level optimization is still important even if you use modern compilers and libraries. Actually, the first version of the algorithm worked 10x slower than the final one. Compilers still suck in the optimization of IO patterns and cannot use AVX/SSE pipelining as good as an experienced engineer could.
The final versions of C# and C++ code are almost identical. Microsoft obviously understands the importance of leveraging low-level optimization. C# is a real tool, not a CS toy. My kudos go to MS folks here. Before we started the “competition”, I expected that naive C++ implementation could be improved 10x, but didn’t expect that .NET could use the same low-level optimizations.
Even in 2019 after platforms convergence, Alex had troubles with running my C++ code developed on Linux. I had troubles with running .NET code as well. Alex’s version requires the most recent version of .NET core, even though C++ version works well with 6 years old GCC or clang.”

P.S. Check out my new project: Stl.Fusion, an open-source library for .NET Core and Blazor striving to be your #1 choice for real-time apps. Its unified state update pipeline is truly unique and mind-blowing.

Geting 4x Speedup With .NET Core 3.0 SIMD Intrinsics

Written by Alex Yakunin