标签云

微信群

扫码加入我们

WeChat QR Code

Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a somewhat similar but less extreme result.


My first thought was that sorting brings the data into the cache, but then I thought how silly that is because the array was just generated.

  • What is going on?
  • Why is it faster to process a sorted array than an unsorted array?
  • The code is summing up some independent terms, and the order should not matter.


Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases.

2018年07月18日44分31秒

user194715: any compiler that uses a cmov or other branchless implementation (like auto-vectorization with pcmpgtd) will have performance that's not data dependent on any CPU. But if it's branchy, it will be sort-dependent on any CPU with out-of-order speculative execution. (Even high-performance in-order CPUs use branch-prediction to avoid fetch/decode bubbles on taken branches; the miss penalty is smaller).

2018年07月17日44分31秒

Woops... re: Meltdown and Spectre

2018年07月17日44分31秒

KyleMit does it have something to do with both? I haven't read much on both

2018年07月17日44分31秒

mohitmun, both of those security flaws fit into a broad category of vulnerabilities classified as “branch target injection” attacks

2018年07月17日44分31秒

Mysticial To avoid the shifting hack you could write something like int t=-((data[c]>=128)) to generate the mask. This should be faster too. Would be interesting to know if the compiler is clever enough to insert a conditional move or not.

2018年07月18日44分31秒

phonetagger Take a look at this followup question: stackoverflow.com/questions/11276291/… The Intel Compiler came pretty close to completely getting rid of the outer loop.

2018年07月17日44分31秒

Novelocrat Only half of that is correct. Shifting a 1 into the sign-bit when it is zero is indeed UB. That's because it's signed integer overflow. But shifting a 1 out of the sign-bit is IB. Right-shifting a negative signed integer is IB. You can go into the argument that that C/C++ doesn't require that the top bit be the sign indicator. But implementation details are IB.

2018年07月17日44分31秒

Mysticial Thanks so much for the link. It looks promising. I will go though it. One last request. Sorry, but please don't mind, could you tell me how you could do this int t = (data[c] - 128) >> 31; sum += ~t & data[c]; to replace the original if-condition above?

2018年07月18日44分31秒

The grammar in me wants me to think this should read "... victim of branch prediction failure" rather than just "... victim of branch prediction fail."

2018年07月17日44分31秒

Does branch prediction work better on sorted arrays vs. arrays with different patterns? For example, for the array --> { 10, 5, 20, 10, 40, 20, ... } the next element in the array from the pattern is 80. Would this kind of array be sped up by branch prediction in which the next element is 80 here if the pattern is followed? Or does it usually only help with sorted arrays?

2018年07月17日44分31秒

So basically everything I conventionally learned about big-O is out of the window? Better to incur a sorting cost than a branching cost?

2018年07月17日44分31秒

AgrimPathak That depends. For not too large input, an algorithm with higher complexity is faster than an algorithm with lower complexity when the constants are smaller for the algorithm with higher complexity. Where the break-even point is can be hard to predict. Also, compare this, locality is important. Big-O is important, but it is not the sole criterion for performance.

2018年07月17日44分31秒

When does branch prediction takes place? When does language will know that array is sorted? I'm thinking of situation of array that looks like: [1,2,3,4,5,...998,999,1000, 3, 10001, 10002] ? will this obscure 3 increase running time? Will it be as long as unsorted array?

2018年07月17日44分31秒

FilipBartuzi Branch prediction takes place in the processor, below the language level (but the language may offer ways to tell the compiler what's likely, so the compiler can emit code suited to that). In your example, the out-of-order 3 will lead to a branch-misprediction (for appropriate conditions, where 3 gives a different result than 1000), and thus processing that array will likely take a couple dozen or hundred nanoseconds longer than a sorted array would, hardly ever noticeable. What costs time is i high rate of mispredictions, one misprediction per 1000 isn't much.

2018年07月17日44分31秒

There's no default optimization level unless you add -O to your GCC command lines. (And you can't have a worst english than mine ;)

2018年07月17日44分31秒

I find it hard to believe that the compiler can optimize the ternary-operator better than it can the equivalent if-statement. You've shown that GCC optimizes the ternary-operator to a conditional move; you haven't shown that it doesn't do exactly the same thing for the if-statement. In fact, according to Mystical above, GCC does optimize the if-statement to a conditional move, which would make this answer completely incorrect.

2018年07月17日44分31秒

WiSaGaN The code demonstrates nothing, because your two pieces of code compile to the same machine code. It's critically important that people don't get the idea that somehow the if statement in your example is different from the terenary in your example. It's true that you own up to the similarity in your last paragraph, but that doesn't erase the fact that the rest of the example is harmful.

2018年07月17日44分31秒

WiSaGaN My downvote would definitely turn into an upvote if you modified your answer to remove the misleading -O0 example and to show the difference in optimized asm on your two testcases.

2018年07月17日44分31秒

UpAndAdam At the moment of the test, VS2010 can't optimize the original branch into a conditional move even when specifying high optimization level, while gcc can.

2018年07月17日44分31秒

If you want to cheat, you might as well take the multiplication outside the loop and do sum*=100000 after the loop.

2018年07月18日44分31秒

Michael - I believe that this example is actually an example of loop-invariant hoisting (LIH) optimization, and NOT loop swap. In this case, the entire inner loop is independent of the outer loop and can therefore be hoisted out of the outer loop, whereupon the result is simply multiplied by a sum over i of one unit =1e5. It makes no difference to the end result, but I just wanted to set the record straight since this is such a frequented page.

2018年07月17日44分31秒

Although not in the simple spirit of swapping loops, the inner if at this point could be converted to: sum += (data[j] >= 128) ? data[j] * 100000 : 0; which the compiler may be able to reduce to cmovge or equivalent.

2018年07月17日44分31秒

The outer loop is to make the time taken by inner loop large enough to profile. So why would you loop swap. At the end, that loop will be removed anyways.

2018年07月17日44分31秒

saurabheights: Wrong question: why would the compiler NOT loop swap. Microbenchmarks is hard ;)

2018年07月17日44分31秒

This is scary, in the unsorted list, there should be 50% chance of hitting the add. Somehow the branch prediction only has a 25% miss rate, how can it do better than 50% miss?

2018年07月17日44分31秒

tall.b.lo: The 25% is of all branches - there are two branches in the loop, one for data[c] >= 128 (which has a 50% miss rate as you suggest) and one for the loop condition c < arraySize which has ~0% miss rate.

2018年07月17日44分31秒

You want to bypass the branch-predictor, why? It's an optimization.

2018年07月17日44分31秒

Because no branch is better than a branch :-) In a lot of situations this is simply a lot faster... if you're optimizing, it's definitely worth a try. They also use it quite a bit in f.ex. graphics.stanford.edu/~seander/bithacks.html

2018年07月17日44分31秒

In general lookup tables can be fast, but have you ran the tests for this particular condition? You'll still have a branch condition in your code, only now it's moved to the look up table generation part. You still wouldn't get your perf boost

2018年07月17日44分31秒

Zain if you really want to know... Yes: 15 seconds with the branch and 10 with my version. Regardless, it's a useful technique to know either way.

2018年07月17日44分31秒

Why not sum += lookup[data[j]] where lookup is an array with 256 entries, the first ones being zero and the last ones being equal to the index?

2018年07月17日44分31秒

You don't show the timings of the "random" TF pattern.

2018年07月17日44分31秒

MooingDuck 'Cause it won't make a difference - that value can be anything, but it still will be in the bounds of these thresholds. So why show a random value when you already know the limits? Although I agree that you could show one for the sake of completeness, and 'just for the heck of it'.

2018年07月17日44分31秒

cst1992: Right now his slowest timing is TTFFTTFFTTFF, which seems, to my human eye, quite predictable. Random is inherently unpredictable, so it's entirely possible it would be slower still, and thus outside the limits shown here. OTOH, it could be that TTFFTTFF perfectly hits the pathological case. Can't tell, since he didn't show the timings for random.

1970年01月01日00分01秒

MooingDuck To a human eye, "TTFFTTFFTTFF" is a predictable sequence, but what we are talking about here is the behavior of the branch predictor built into a CPU. The branch predictor is not AI-level pattern recognition; it's very simple. When you just alternate branches it doesn't predict well. In most code, branches go the same way almost all the time; consider a loop that executes a thousand times. The branch at the end of the loop goes back to the start of the loop 999 times, and then the thousandth time does something different. A very simple branch predictor works well, usually.

2018年07月17日44分31秒

steveha: I think you're making assumptions about how the CPU branch predictor works, and I disagree with that methodology. I don't know how advanced that branch predictor is, but I seem to think it's far more advanced than you do. You're probably right, but measurements would definitely be good.

2018年07月18日44分31秒

Right, you can also just use the bit directly and multiply (data[c]>>7 - which is discussed somewhere here as well); I intentionally left this solution out, but of course you are correct. Just a small note: The rule of thumb for lookup tables is that if it fits in 4KB (because of caching), it'll work - preferably make the table as small as possible. For managed languages I'd push that to 64KB, for low-level languages like C++ and C, I'd probably reconsider (that's just my experience). Since typeof(int) = 4, I'd try to stick to max 10 bits.

2018年07月17日44分31秒

I think indexing with the 0/1 value will probably be faster than an integer multiply, but I guess if performance is really critical you should profile it. I agree that small lookup tables are essential to avoid cache pressure, but clearly if you have a bigger cache you can get away with a bigger lookup table, so 4KB is more a rule of thumb than a hard rule. I think you meant sizeof(int) == 4? That would be true for 32-bit. My two-year-old cell phone has a 32KB L1 cache, so even a 4K lookup table might work, especially if the lookup values were a byte instead of an int.

2018年07月17日44分31秒

Possibly I'm missing something but in your j equals 0 or 1 method why don't you just multiply your value by j before adding it rather than using the array indexing (possibly should be multiplied by 1-j rather than j)

2018年07月17日44分31秒

steveha Multiplication should be faster, I tried looking it up in the Intel books, but couldn't find it... either way, benchmarking also gives me that result here.

2018年07月18日44分31秒

steveha P.S.: another possible answer would be int c = data[j]; sum += c & -(c >> 7); which requires no multiplications at all.

2018年07月18日44分31秒

sum= 3137536 - clever. That's kinda obviously not the point of the question. The question is clearly about explaining surprising performance characteristics. I'm inclined to say that the addition of doing std::partition instead of std::sort is valuable. Though the actual question extends to more than just the synthetic benchmark given.

2018年07月17日44分31秒

DeadMG: this is indeed not the standard dichotomic search for a given key, but a search for the partitioning index; it requires a single compare per iteration. But don't rely on this code, I have not checked it. If you are interested in a guaranteed correct implementation, let me know.

2018年07月17日44分31秒

how are two instructions executed together? is this done with separate cpu cores or is pipeline instruction is integrated in single cpu core?

2018年07月17日44分31秒

M.kazemAkhgary It's all inside one logical core. If you're interested, this is nicely described for example in Intel Software Developer Manual

2018年07月17日44分31秒

That is a very interesting article (in fact, I have just read all of it), but how does it answer the question?

2018年07月18日44分31秒

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review

2018年07月17日44分31秒

PeterMortensen I am a bit flummoxed by your question. For example here is one relevant line from that piece: When the input is unsorted, all the rest of the loop takes substantial time. But with sorted input, the processor is somehow able to spend not just less time in the body of the loop, meaning the buckets at offsets 0x18 and 0x1C, but vanishingly little time on the mechanism of looping. Author is trying to discuss profiling in the context of code posted here and in the process trying to explain why the sorted case is so much more faster.

2018年07月17日44分31秒

I did not understand shit. Plus because I liked the pictures.

2018年07月17日44分31秒

What is the source/context for the last link?

2018年07月18日44分31秒

Are you saying that every instruction can be conditional? So, multiple instructions with the GE suffix could be performed sequentially, without changing the value of R3 in between?

2018年07月17日44分31秒

Yes, correct, every instruction can be conditional on ARM, at least in the 32 and 64 bit instruction sets. There's a devoted 4-bit condition field. You can have several instructions in a row with the same condition, but at some point, if the chance of the condition being false is non-negligible, then it is more efficient to add a branch.

2018年07月17日44分31秒

The other innovation in ARM is the addition of the S instruction suffix, also optional on (almost) all instructions, which if absent, prevents instructions from changing status bits (with the exception of the CMP instruction, whose job is to set status bits, so it doesn't need the S suffix). This allows you to avoid CMP instructions in many cases, as long as the comparison is with zero or similar (eg. SUBS R0, R0, #1 will set the Z (Zero) bit when R0 reaches zero). Conditionals and the S suffix incur zero overhead. It's quite a beautiful ISA.

2018年07月17日44分31秒

Not adding the S suffix allows you to have several conditional instructions in a row without worrying that one of them might change the status bits, which might otherwise have the side effect of skipping the rest of the conditional instructions.

2018年07月17日44分31秒

"In simple words" - I find your explanation less simple than the other with trains and far less accurate than any of other answer, though I am not a beginner. I am very curious why there are so many upvotes, perhaps one of future upvoters can tell me?

2018年07月17日44分31秒

Sinatr it's probably really opinion based, i myself found that good enough to upvote it, it's ofc not as accurate as other examples, that's the whole point: giving away the answer (as we can all agree branch-prediction is involved here) without having readers to go trought technical explanations as others did (very well). And i think he did it well enough.

2018年07月17日44分31秒

Your tone is doesn't seem to be in keeping with stackoverflow.com/help/be-nice. OP knows that theoretically they should be the same, but has measured them not being the same, hence their question =]

2018年07月17日44分31秒

Tone policing is an ad hominem attack of sound logic. I'm here to be as kind as possible in positing why my solution may be kinder than everyone else's in not assuming the poster knew anything. But thanks for knocking off some reputation and not just removing my answer. I don't get why I bother contributing if I'm going to be penalized for framing MY answer in the way I choose? Please delete my answer if you don't like my tone, but don't assail my logic OR my kindness to the poster. I was nothing BUT kind. Also, I'd like my 20 points of reputation back that you glibly stole. That sucks.

2018年07月17日44分31秒

I didn't down vote, I just thought you should be aware of why your answer may not be getting the attention you believe it merits. Your capitalisation and insistence that only you are correct comes across as arrogance, not sincerity. You also sound like you are being condescending and sound plain angry. I'm not saying you are/were intending any of those things, but clearly people felt that your question was deserving down votes because of that reason.

2018年07月17日44分31秒

I know you didn’t. I’m not really responding to you. I also am not trying to be condescending, and understand it’s coming off the way it is. My intention with framing my answer the way it is was to answer the question for that earnest college student who has the same question that the poster did, but may not be far enough ahead in their career to understand the advanced concepts necessary to derive value from the significance of the question OR the answer. Why would I care about sincerity, arrogance, attention or merit? You’re commenting, so are only you wondering if I’m wrong OR right?

2018年07月17日44分31秒

Argument on a forum is meant to find truth. My tone is necessary to capture the thought process of someone who has the same issue, isn’t the OP (who may already have the answer they need), and to give an argument in the lowest abstraction level that branch prediction itself rests upon. If this is a forum, and everyone understands argument’s actual meaning, then people would have just left my answer alone with no comment. BTW: I hope this answer gets more downvotes and I can never post again, so I can solve P=NP and am forced to go get my millennium prize. (THAT’S hubris, yet still germane)

2018年07月18日44分31秒