标签云

微信群

扫码加入我们

WeChat QR Code

为什么它更快的处理比未排序的数组排序的数组?

Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a somewhat similar but less extreme result.


My first thought was that sorting brings the data into the cache, but then I thought how silly that is because the array was just generated.

  • What is going on?
  • Why is it faster to process a sorted array than an unsorted array?
  • The code is summing up some independent terms, and the order should not matter.


Just for the record. On Windows / VS2017 / i7-6700K 4GHz there is NO difference between two versions. It takes 0.6s for both cases. If number of iterations in the external loop is increased 10 times the execution time increases 10 times too to 6s in both cases.

2018年05月23日34分38秒

user194715: any compiler that uses a cmov or other branchless implementation (like auto-vectorization with pcmpgtd) will have performance that's not data dependent on any CPU. But if it's branchy, it will be sort-dependent on any CPU with out-of-order speculative execution. (Even high-performance in-order CPUs use branch-prediction to avoid fetch/decode bubbles on taken branches; the miss penalty is smaller).

2018年05月23日34分38秒

Woops... re: Meltdown and Spectre

2018年05月23日34分38秒

KyleMit does it have something to do with both? I haven't read much on both

2018年05月23日34分38秒

mohitmun, both of those security flaws fit into a broad category of vulnerabilities classified as “branch target injection” attacks

2018年05月23日34分38秒

Mysticial To avoid the shifting hack you could write something like int t=-((data[c]>=128)) to generate the mask. This should be faster too. Would be interesting to know if the compiler is clever enough to insert a conditional move or not.

2018年05月23日34分38秒

phonetagger Take a look at this followup question: stackoverflow.com/questions/11276291/… The Intel Compiler came pretty close to completely getting rid of the outer loop.

2018年05月23日34分38秒

Novelocrat Only half of that is correct. Shifting a 1 into the sign-bit when it is zero is indeed UB. That's because it's signed integer overflow. But shifting a 1 out of the sign-bit is IB. Right-shifting a negative signed integer is IB. You can go into the argument that that C/C++ doesn't require that the top bit be the sign indicator. But implementation details are IB.

2018年05月23日34分38秒

Mysticial Thanks so much for the link. It looks promising. I will go though it. One last request. Sorry, but please don't mind, could you tell me how you could do this int t = (data[c] - 128) >> 31; sum += ~t & data[c]; to replace the original if-condition above?

2018年05月23日34分38秒

The grammar in me wants me to think this should read "... victim of branch prediction failure" rather than just "... victim of branch prediction fail."

2018年05月23日34分38秒

Does branch prediction work better on sorted arrays vs. arrays with different patterns? For example, for the array --> { 10, 5, 20, 10, 40, 20, ... } the next element in the array from the pattern is 80. Would this kind of array be sped up by branch prediction in which the next element is 80 here if the pattern is followed? Or does it usually only help with sorted arrays?

2018年05月23日34分38秒

So basically everything I conventionally learned about big-O is out of the window? Better to incur a sorting cost than a branching cost?

2018年05月23日34分38秒

AgrimPathak That depends. For not too large input, an algorithm with higher complexity is faster than an algorithm with lower complexity when the constants are smaller for the algorithm with higher complexity. Where the break-even point is can be hard to predict. Also, compare this, locality is important. Big-O is important, but it is not the sole criterion for performance.

2018年05月23日34分38秒

When does branch prediction takes place? When does language will know that array is sorted? I'm thinking of situation of array that looks like: [1,2,3,4,5,...998,999,1000, 3, 10001, 10002] ? will this obscure 3 increase running time? Will it be as long as unsorted array?

2018年05月23日34分38秒

FilipBartuzi Branch prediction takes place in the processor, below the language level (but the language may offer ways to tell the compiler what's likely, so the compiler can emit code suited to that). In your example, the out-of-order 3 will lead to a branch-misprediction (for appropriate conditions, where 3 gives a different result than 1000), and thus processing that array will likely take a couple dozen or hundred nanoseconds longer than a sorted array would, hardly ever noticeable. What costs time is i high rate of mispredictions, one misprediction per 1000 isn't much.

2018年05月23日34分38秒

There's no default optimization level unless you add -O to your GCC command lines. (And you can't have a worst english than mine ;)

2018年05月23日34分38秒

I find it hard to believe that the compiler can optimize the ternary-operator better than it can the equivalent if-statement. You've shown that GCC optimizes the ternary-operator to a conditional move; you haven't shown that it doesn't do exactly the same thing for the if-statement. In fact, according to Mystical above, GCC does optimize the if-statement to a conditional move, which would make this answer completely incorrect.

2018年05月23日34分38秒

WiSaGaN The code demonstrates nothing, because your two pieces of code compile to the same machine code. It's critically important that people don't get the idea that somehow the if statement in your example is different from the terenary in your example. It's true that you own up to the similarity in your last paragraph, but that doesn't erase the fact that the rest of the example is harmful.

2018年05月23日34分38秒

WiSaGaN My downvote would definitely turn into an upvote if you modified your answer to remove the misleading -O0 example and to show the difference in optimized asm on your two testcases.

2018年05月23日34分38秒

UpAndAdam At the moment of the test, VS2010 can't optimize the original branch into a conditional move even when specifying high optimization level, while gcc can.

2018年05月23日34分38秒

If you want to cheat, you might as well take the multiplication outside the loop and do sum*=100000 after the loop.

2018年05月24日34分38秒

Michael - I believe that this example is actually an example of loop-invariant hoisting (LIH) optimization, and NOT loop swap. In this case, the entire inner loop is independent of the outer loop and can therefore be hoisted out of the outer loop, whereupon the result is simply multiplied by a sum over i of one unit =1e5. It makes no difference to the end result, but I just wanted to set the record straight since this is such a frequented page.

2018年05月23日34分38秒

Although not in the simple spirit of swapping loops, the inner if at this point could be converted to: sum += (data[j] >= 128) ? data[j] * 100000 : 0; which the compiler may be able to reduce to cmovge or equivalent.

2018年05月23日34分38秒

The outer loop is to make the time taken by inner loop large enough to profile. So why would you loop swap. At the end, that loop will be removed anyways.

2018年05月23日34分38秒

saurabheights: Wrong question: why would the compiler NOT loop swap. Microbenchmarks is hard ;)

2018年05月23日34分38秒

This is scary, in the unsorted list, there should be 50% chance of hitting the add. Somehow the branch prediction only has a 25% miss rate, how can it do better than 50% miss?

2018年05月23日34分38秒

tall.b.lo: The 25% is of all branches - there are two branches in the loop, one for data[c] >= 128 (which has a 50% miss rate as you suggest) and one for the loop condition c < arraySize which has ~0% miss rate.

2018年05月23日34分38秒

You want to bypass the branch-predictor, why? It's an optimization.

2018年05月23日34分38秒

Because no branch is better than a branch :-) In a lot of situations this is simply a lot faster... if you're optimizing, it's definitely worth a try. They also use it quite a bit in f.ex. graphics.stanford.edu/~seander/bithacks.html

2018年05月23日34分38秒

In general lookup tables can be fast, but have you ran the tests for this particular condition? You'll still have a branch condition in your code, only now it's moved to the look up table generation part. You still wouldn't get your perf boost

2018年05月23日34分38秒

Zain if you really want to know... Yes: 15 seconds with the branch and 10 with my version. Regardless, it's a useful technique to know either way.

2018年05月23日34分38秒

Why not sum += lookup[data[j]] where lookup is an array with 256 entries, the first ones being zero and the last ones being equal to the index?

2018年05月23日34分38秒

You don't show the timings of the "random" TF pattern.

2018年05月23日34分38秒

MooingDuck 'Cause it won't make a difference - that value can be anything, but it still will be in the bounds of these thresholds. So why show a random value when you already know the limits? Although I agree that you could show one for the sake of completeness, and 'just for the heck of it'.

2018年05月23日34分38秒

cst1992: Right now his slowest timing is TTFFTTFFTTFF, which seems, to my human eye, quite predictable. Random is inherently unpredictable, so it's entirely possible it would be slower still, and thus outside the limits shown here. OTOH, it could be that TTFFTTFF perfectly hits the pathological case. Can't tell, since he didn't show the timings for random.

1970年01月01日00分01秒

MooingDuck To a human eye, "TTFFTTFFTTFF" is a predictable sequence, but what we are talking about here is the behavior of the branch predictor built into a CPU. The branch predictor is not AI-level pattern recognition; it's very simple. When you just alternate branches it doesn't predict well. In most code, branches go the same way almost all the time; consider a loop that executes a thousand times. The branch at the end of the loop goes back to the start of the loop 999 times, and then the thousandth time does something different. A very simple branch predictor works well, usually.

2018年05月23日34分38秒

steveha: I think you're making assumptions about how the CPU branch predictor works, and I disagree with that methodology. I don't know how advanced that branch predictor is, but I seem to think it's far more advanced than you do. You're probably right, but measurements would definitely be good.

2018年05月23日34分38秒

Right, you can also just use the bit directly and multiply (data[c]>>7 - which is discussed somewhere here as well); I intentionally left this solution out, but of course you are correct. Just a small note: The rule of thumb for lookup tables is that if it fits in 4KB (because of caching), it'll work - preferably make the table as small as possible. For managed languages I'd push that to 64KB, for low-level languages like C++ and C, I'd probably reconsider (that's just my experience). Since typeof(int) = 4, I'd try to stick to max 10 bits.

2018年05月23日34分38秒

I think indexing with the 0/1 value will probably be faster than an integer multiply, but I guess if performance is really critical you should profile it. I agree that small lookup tables are essential to avoid cache pressure, but clearly if you have a bigger cache you can get away with a bigger lookup table, so 4KB is more a rule of thumb than a hard rule. I think you meant sizeof(int) == 4? That would be true for 32-bit. My two-year-old cell phone has a 32KB L1 cache, so even a 4K lookup table might work, especially if the lookup values were a byte instead of an int.

2018年05月23日34分38秒

Possibly I'm missing something but in your j equals 0 or 1 method why don't you just multiply your value by j before adding it rather than using the array indexing (possibly should be multiplied by 1-j rather than j)

2018年05月23日34分38秒

steveha Multiplication should be faster, I tried looking it up in the Intel books, but couldn't find it... either way, benchmarking also gives me that result here.

2018年05月23日34分38秒

steveha P.S.: another possible answer would be int c = data[j]; sum += c & -(c >> 7); which requires no multiplications at all.

2018年05月23日34分38秒

sum= 3137536 - clever. That's kinda obviously not the point of the question. The question is clearly about explaining surprising performance characteristics. I'm inclined to say that the addition of doing std::partition instead of std::sort is valuable. Though the actual question extends to more than just the synthetic benchmark given.

2018年05月23日34分38秒

DeadMG: this is indeed not the standard dichotomic search for a given key, but a search for the partitioning index; it requires a single compare per iteration. But don't rely on this code, I have not checked it. If you are interested in a guaranteed correct implementation, let me know.

2018年05月23日34分38秒

how are two instructions executed together? is this done with separate cpu cores or is pipeline instruction is integrated in single cpu core?

2018年05月23日34分38秒

M.kazemAkhgary It's all inside one logical core. If you're interested, this is nicely described for example in Intel Software Developer Manual

2018年05月23日34分38秒

That is a very interesting article (in fact, I have just read all of it), but how does it answer the question?

2018年05月23日34分38秒

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review

2018年05月23日34分38秒

PeterMortensen I am a bit flummoxed by your question. For example here is one relevant line from that piece: When the input is unsorted, all the rest of the loop takes substantial time. But with sorted input, the processor is somehow able to spend not just less time in the body of the loop, meaning the buckets at offsets 0x18 and 0x1C, but vanishingly little time on the mechanism of looping. Author is trying to discuss profiling in the context of code posted here and in the process trying to explain why the sorted case is so much more faster.

2018年05月23日34分38秒

I did not understand shit. Plus because I liked the pictures.

2018年05月23日34分38秒

What is the source/context for the last link?

2018年05月23日34分38秒

It adds some fancy ascii art :>

2018年05月23日34分38秒

Thank you tim yesterday

2018年05月23日34分38秒

well. it added 284 reps to the author

2018年05月23日34分38秒

Yeah, But I didn't do this for the reps

2018年05月23日34分38秒

Its just 190 btw

2018年05月23日34分38秒