标签云

微信群

扫码加入我们

WeChat QR Code

I was looking for the fastest way to popcount large arrays of data. I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.The Benchmark#include <iostream>#include <chrono>#include <x86intrin.h>int main(int argc, char* argv[]) {using namespace std;if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1;}uint64_t size = atol(argv[1])<<20;uint64_t* buffer = new uint64_t[size/8];char* charbuffer = reinterpret_cast<char*>(buffer);for (unsigned i=0; i<size; ++i)charbuffer[i] = rand()%256;uint64_t count,duration;chrono::time_point<chrono::system_clock> startP,endP;{startP = chrono::system_clock::now();count = 0;for( unsigned k = 0; k < 10000; k++){// Tight unrolled loop with unsignedfor (unsigned i=0; i<size/8; i+=4) {count += _mm_popcnt_u64(buffer[i]);count += _mm_popcnt_u64(buffer[i+1]);count += _mm_popcnt_u64(buffer[i+2]);count += _mm_popcnt_u64(buffer[i+3]);}}endP = chrono::system_clock::now();duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl;}{startP = chrono::system_clock::now();count=0;for( unsigned k = 0; k < 10000; k++){// Tight unrolled loop with uint64_tfor (uint64_t i=0;i<size/8;i+=4) {count += _mm_popcnt_u64(buffer[i]);count += _mm_popcnt_u64(buffer[i+1]);count += _mm_popcnt_u64(buffer[i+2]);count += _mm_popcnt_u64(buffer[i+3]);}}endP = chrono::system_clock::now();duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();cout << "uint64_t\t"<< count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl;}free(charbuffer);}As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is unsigned, in the lower case, the inner loop variable is uint64_t. I thought that this should make no difference, but the opposite is the case.The (absolutely crazy) resultsI compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):g++ -O3 -march=native -std=c++11 test.cpp -o testHere are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data):unsigned419593600000.401554 sec 26.113 GB/suint64_t419593600000.759822 sec 13.8003 GB/sAs you see, the throughput of the uint64_t version is only half the one of the unsigned version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):clang++ -O3 -march=native -std=c++11 teest.cpp -o testResult: test 1unsigned419593600000.398293 sec 26.3267 GB/suint64_t419593600000.680954 sec 15.3986 GB/sSo, it is almost the same result and is still strange. But now it gets super strange. I replace the buffer size that was read from input with a constant 1, so I change:uint64_t size = atol(argv[1]) << 20;touint64_t size = 1 << 20;Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for g++:unsigned419593600000.509156 sec 20.5944 GB/suint64_t419593600000.508673 sec 20.6139 GB/sNow, both versions are equally fast. However, the unsigned got even slower! It dropped from 26 to 20 GB/s, thus replacing a non-constant by a constant value lead to a deoptimization. Seriously, I have no clue what is going on here! But now to clang++ with the new version:unsigned419593600000.677009 sec 15.4884 GB/suint64_t419593600000.676909 sec 15.4906 GB/sWait, what? Now, both versions dropped to the slow number of 15 GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!I asked a colleague with an Ivy Bridge CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.More madness, please!Take the first example (the one with atol(argv[1])) and put a static before the variable, i.e.:static uint64_t size=atol(argv[1])<<20;Here are my results in g++:unsigned419593600000.396728 sec 26.4306 GB/suint64_t419593600000.509484 sec 20.5811 GB/sYay, yet another alternative. We still have the fast 26 GB/s with u32, but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. Sadly, this only works for g++, clang++ does not seem to care about static.My questionCan you explain these results? Especially:How can there be such a difference between u32 and u64?How can replacing a non-constant by a constant buffer size trigger less optimal code?How can the insertion of the static keyword make the u64 loop faster? Even faster than the original code on my collegue's computer!I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26 GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?The DisassemblyHere is the disassembly for the various results:26 GB/s version from g++ / u32 / non-const bufsize:0x400af8:lea 0x1(%rdx),%eaxpopcnt (%rbx,%rax,8),%r9lea 0x2(%rdx),%edipopcnt (%rbx,%rcx,8),%raxlea 0x3(%rdx),%esiadd %r9,%raxpopcnt (%rbx,%rdi,8),%rcxadd $0x4,%edxadd %rcx,%raxpopcnt (%rbx,%rsi,8),%rcxadd %rcx,%raxmov %edx,%ecxadd %rax,%r14cmp %rbp,%rcxjb 0x400af813 GB/s version from g++ / u64 / non-const bufsize:0x400c00:popcnt 0x8(%rbx,%rdx,8),%rcxpopcnt (%rbx,%rdx,8),%raxadd %rcx,%raxpopcnt 0x10(%rbx,%rdx,8),%rcxadd %rcx,%raxpopcnt 0x18(%rbx,%rdx,8),%rcxadd $0x4,%rdxadd %rcx,%raxadd %rax,%r12cmp %rbp,%rdxjb 0x400c0015 GB/s version from clang++ / u64 / non-const bufsize:0x400e50:popcnt (%r15,%rcx,8),%rdxadd %rbx,%rdxpopcnt 0x8(%r15,%rcx,8),%rsiadd %rdx,%rsipopcnt 0x10(%r15,%rcx,8),%rdxadd %rsi,%rdxpopcnt 0x18(%r15,%rcx,8),%rbxadd %rdx,%rbxadd $0x4,%rcxcmp %rbp,%rcxjb 0x400e5020 GB/s version from g++ / u32&u64 / const bufsize:0x400a68:popcnt (%rbx,%rdx,1),%raxpopcnt 0x8(%rbx,%rdx,1),%rcxadd %rax,%rcxpopcnt 0x10(%rbx,%rdx,1),%raxadd %rax,%rcxpopcnt 0x18(%rbx,%rdx,1),%rsiadd $0x20,%rdxadd %rsi,%rcxadd %rcx,%rbpcmp $0x100000,%rdxjne 0x400a6815 GB/s version from clang++ / u32&u64 / const bufsize:0x400dd0:popcnt (%r14,%rcx,8),%rdxadd %rbx,%rdxpopcnt 0x8(%r14,%rcx,8),%rsiadd %rdx,%rsipopcnt 0x10(%r14,%rcx,8),%rdxadd %rsi,%rdxpopcnt 0x18(%r14,%rcx,8),%rbxadd %rdx,%rbxadd $0x4,%rcxcmp $0x20000,%rcxjb 0x400dd0Interestingly, the fastest (26 GB/s) version is also the longest! It seems to be the only solution that uses lea. Some versions use jb to jump, others use jne. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13 GB/s) version looks even very short and good. Can anyone explain this?Lessons learnedNo matter what the answer to this question will be; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.


SO MANY COMMENTS! You can view them in chat and even leave your own there if you want, but please don't add any more here!

2019年06月25日27分18秒

Also see GCC Issue 62011, False Data Dependency in popcnt instruction. Someone else provided it, but it seems to have been lost during cleanups.

2019年06月26日27分18秒

I can't tell but is one of the disassemblies for the version with the static? If not, can you edit the post and add it?

2019年06月26日27分18秒

Hi folks! Lots of past comments here; before leaving a new one, please review the archive.

2019年06月25日27分18秒

This still reproduces in clang at head.I've filed a bug: bugs.llvm.org/show_bug.cgi?id=34936.

2019年06月26日27分18秒

JustinL.it looks like this particular issue is fixed in Clang as of 7.0

2019年06月26日27分18秒

“What's more, gcc believes the 64-bit integer […] to be better, as using uint_fast32_t causes gcc to use a 64-bit uint.” Unfortunately, and to my regret, there is no magic and no deep code introspection behind these types.I’ve yet to see them provided any other way than as single typedefs for every possible place and every program on the whole platform. There has likely been put quite some thought behind the exact choice of types, but the one definition for each of them cannot possibly fit to every application there will ever be.Some further reading: stackoverflow.com/q/4116297.

2019年06月26日27分18秒

Keno That's because sizeof(uint_fast32_t) has to be defined. If you allow it not to be, you can do that trickery, but that can only be accomplished with a compiler extension.

2019年06月26日27分18秒

Interesting, can you add compiler version and compiler flags? The best thing is that on your machine, the results are turned around, i.e., using u64 is faster. Until now, I have never thought about which type my loop variable has, but it seems I have to think twice next time :).

2019年06月25日27分18秒

gexicide: I wouldn't call a jump from 16.8201 to 16.8126 making it "faster".

2019年06月25日27分18秒

Mehrdad: The jump I mean is the one between 12.9 and 16.8, so unsigned is faster here. In my benchmark, the opposite was the case, i.e. 26 for unsigned, 15 for uint64_t

2019年06月26日27分18秒

gexicide Have you notice the difference in addressing buffer[i]?

2019年06月25日27分18秒

Calvin: No, what do you mean?

2019年06月26日27分18秒

Unfortunately, ever since (Core 2?) there are virtually no performance differences between 32-bit and 64-bit integer operations except for multiply/divide - which aren't present in this code.

2019年06月26日27分18秒

Gene: Note that all versions store the size in a register and never read it from stack in the loop. Thus, address calculation cannot be in the mix, at least not inside the loop.

2019年06月25日27分18秒

Gene: Interesting explanation indeed! But it does not explain the main WTF points: That 64bit is slower than 32bit due to pipeline stalls is one thing. But if this is the case, shouldn't the 64bit version be reliably slower than the 32bit one?Instead, three different compilers emit slow code even for the 32bit version when using compile-time-constant buffer size; changing the buffer size to static again changes things completely. There was even a case on my colleagues machine (and in Calvin's answer) where the 64bit version is considerably faster! It seems to be absolutely unpredictable..

2019年06月25日27分18秒

Mysticial That's my point. There is no peak performance difference when there's zero contention for IU, bus time, etc.The reference clearly shows that. Contention makes everything different. Here's an example from the Intel Core literature: "One new technology included in the design is Macro-Ops Fusion, which combines two x86 instructions into a single micro-operation. For example, a common code sequence like a compare followed by a conditional jump would become a single micro-op. Unfortunately, this technology does not work in 64-bit mode."So we have a 2:1 ratio in execution speed.

2019年06月26日27分18秒

gexicide I see what you're saying, but you're inferring more than I meant.I'm saying the code that's running the fastest is keeping the pipeline and dispatch queues full. This condition is fragile. Minor changes like adding 32 bits to the total data flow and instruction reordering are enough to break it. In short, the OP assertion that fiddling and testing is the only way forward is correct.

2019年06月25日27分18秒

But still, your results are totally strange (first unsigned faster, then uint64_t faster) as unrolling does not fix the main problem of the false dependency.

2019年06月25日27分18秒

That was the first thing I've did after I've read the question. Break the dependency chain. As it turned out the performance difference does not change (on my computer at least - Intel Haswell with GCC 4.7.3).

2019年06月26日27分18秒

BenVoigt: It is conformant to strict aliasing. void* and char* are the two types which may be aliased, as they are esentially considered "pointers into some chunk of memory"! Your idea concerning the data dependency removal is nice for optimization, but it does not answer the question. And, as NilsPipenbrinck says, it does not seem to change anything.

2019年06月26日27分18秒

gexicide: The strict aliasing rule is not symmetric.You can use char* to access a T[].You cannot safely use a T* to access a char[], and your code appears to do the latter.

2019年06月25日27分18秒

BenVoigt: Then you could never savely malloc an array of anything, as malloc returns void* and you interpret it as T[]. And I am pretty sure that void* and char* had the same semantics concerning strict aliasing. However, I guess this is quite offtopic here:)

2019年06月26日27分18秒

Personally I think the right way is uint64_t* buffer = new uint64_t[size/8]; /* type is clearly uint64_t[] */ char* charbuffer=reinterpret_cast<char*>(buffer); /* aliasing a uint64_t[] with char* is safe */

2019年06月26日27分18秒

It's just good luck that -funroll-loops happens to make code that doesn't bottleneck on a loop-carried dependency chain created by popcnt's false dep.Using an old compiler version that doesn't know about the false dependency is a risk.Without -funroll-loops, gcc 4.8.5's loop will bottleneck on popcnt latency instead of throughput, because it counts into rdx.The same code, compiled by gcc 4.9.3 adds an xor edx,edx to break the dependency chain.

2019年06月25日27分18秒

With old compilers, your code would still be vulnerable to exactly the same performance variation the OP experienced: seemingly-trivial changes could make gcc something slow because it had no idea it would cause a problem.Finding something that happens to work in one case on an old compiler is not the question.

2019年06月25日27分18秒

For the record, x86intrin.h's _mm_popcnt_* functions on GCC are forcibly inlined wrappers around the __builtin_popcount*; the inlining should make one exactly equivalent to the other. I highly doubt you'd see any difference that could be caused by switching between them.

2019年06月25日27分18秒

If you're timing in core clock cycles (instead of seconds), 1 second is plenty of time for a tiny CPU-bound loop.Even 100ms is fine for finding major differences or checking perf counters for uop counts.Especially on a Skylake, where hardware P-state management lets it ramp up to max clock speed in microseconds after load starts.

2019年06月26日27分18秒

clang can auto-vectorize __builtin_popcountl with AVX2 vpshufb, and doesn't need multiple accumulators in the C source to do so.I'm not sure about _mm_popcnt_u64; that might only auto-vectorize with AVX512-VPOPCNT.(See Counting 1 bits (population count) on large data using AVX-512 or AVX-2/)

2019年06月26日27分18秒

But anyway, looking at Intel's optimization manual won't help: as the accepted answer shows, the problem is an unexpected output dependency for popcnt.This is documented in Intel's errata for some of their recent microarchitectures, but I think wasn't at the time.Your dep-chain analysis will fail if there are unexpected false dependencies, so this answer is good generic advice but not applicable here.

2019年06月25日27分18秒

Modern GCC knows about the false dependency and avoids it, often using things like xor eax,eax / popcnt rax, [rdi] or mov rax, [rdi] / popcnt rax,rax.Compiling a C version with a modern compiler sidesteps the whole point of the question.

2019年06月25日27分18秒

As far as I can see Peter used: gcc (Ubuntu 4.8.4-2ubuntu1~14.04.1) 4.8.4 which is hardly a modern version.

2019年06月25日27分18秒

It seems much more likely to me that using static happened to change register allocation for the function in a way that affected the false output dependency of popcnt on the Intel CPUs the OP was testing on, with a compiler that didn't know to avoid them.(Because this performance pothole in Intel CPUs hadn't been discovered yet.)A compiler can keep a static local variable in a register, just like an automatic storage variable, but if they don't optimize assuming main only runs once, then it will affect code-gen (because the value is set by the first call only.)

2019年06月26日27分18秒

Anyway, the performance difference between [RIP + rel32] and [rsp + 42] addressing modes is pretty negligible for most cases.cmp dword [RIP+rel32], immediate can't micro-fuse into a single load+cmp uop, but I don't think that's going to be a factor.Like I said, inside loops it probably stays in a register anyway, but tweaking the C++ can mean different compiler choices.

2019年06月26日27分18秒