标签云

微信群

扫码加入我们

WeChat QR Code

I was implementing an algorithm in Swift Beta and noticed that the performance was very poor. After digging deeper I realized that one of the bottlenecks was something as simple as sorting arrays. The relevant part is here:let n = 1000000var x =[Int](repeating: 0, count: n)for i in 0..<n {x[i] = random()}// start clock herelet y = sort(x)// stop clock hereIn C++, a similar operation takes 0.06s on my computer.In Python, it takes 0.6s (no tricks, just y = sorted(x) for a list of integers).In Swift it takes 6s if I compile it with the following command:xcrun swift -O3 -sdk `xcrun --show-sdk-path --sdk macosx`And it takes as much as 88s if I compile it with the following command:xcrun swift -O0 -sdk `xcrun --show-sdk-path --sdk macosx`Timings in Xcode with "Release" vs. "Debug" builds are similar.What is wrong here? I could understand some performance loss in comparison with C++, but not a 10-fold slowdown in comparison with pure Python.Edit: weather noticed that changing -O3 to -Ofast makes this code run almost as fast as the C++ version! However, -Ofast changes the semantics of the language a lot — in my testing, it disabled the checks for integer overflows and array indexing overflows. For example, with -Ofast the following Swift code runs silently without crashing (and prints out some garbage):let n = 10000000print(n*n*n*n*n)let x =[Int](repeating: 10, count: n)print(x[n])So -Ofast is not what we want; the whole point of Swift is that we have the safety nets in place. Of course, the safety nets have some impact on the performance, but they should not make the programs 100 times slower. Remember that Java already checks for array bounds, and in typical cases, the slowdown is by a factor much less than 2. And in Clang and GCC we have got -ftrapv for checking (signed) integer overflows, and it is not that slow, either.Hence the question: how can we get reasonable performance in Swift without losing the safety nets?Edit 2: I did some more benchmarking, with very simple loops along the lines offor i in 0..<n {x[i] = x[i] ^ 12345678}(Here the xor operation is there just so that I can more easily find the relevant loop in the assembly code. I tried to pick an operation that is easy to spot but also "harmless" in the sense that it should not require any checks related to integer overflows.)Again, there was a huge difference in the performance between -O3 and -Ofast. So I had a look at the assembly code:With -Ofast I get pretty much what I would expect. The relevant part is a loop with 5 machine language instructions.With -O3 I get something that was beyond my wildest imagination. The inner loop spans 88 lines of assembly code. I did not try to understand all of it, but the most suspicious parts are 13 invocations of "callq _swift_retain" and another 13 invocations of "callq _swift_release". That is, 26 subroutine calls in the inner loop!Edit 3: In comments, Ferruccio asked for benchmarks that are fair in the sense that they do not rely on built-in functions (e.g. sort). I think the following program is a fairly good example:let n = 10000var x = [Int](repeating: 1, count: n)for i in 0..<n {for j in 0..<n {x[i] = x[j]}}There is no arithmetic, so we do not need to worry about integer overflows. The only thing that we do is just lots of array references. And the results are here—Swift -O3 loses by a factor almost 500 in comparison with -Ofast:C++ -O3: 0.05 sC++ -O0: 0.4 sJava: 0.2 sPython with PyPy: 0.5 sPython: 12 sSwift -Ofast: 0.05 sSwift -O3: 23 sSwift -O0: 443 s(If you are concerned that the compiler might optimize out the pointless loops entirely, you can change it to e.g. x[i] ^= x[j], and add a print statement that outputs x[0]. This does not change anything; the timings will be very similar.)And yes, here the Python implementation was a stupid pure Python implementation with a list of ints and nested for loops. It should be much slower than unoptimized Swift. Something seems to be seriously broken with Swift and array indexing.Edit 4: These issues (as well as some other performance issues) seems to have been fixed in Xcode 6 beta 5.For sorting, I now have the following timings:clang++ -O3: 0.06 sswiftc -Ofast: 0.1 sswiftc -O: 0.1 sswiftc: 4 sFor nested loops:clang++ -O3: 0.06 sswiftc -Ofast: 0.3 sswiftc -O: 0.4 sswiftc: 540 sIt seems that there is no reason anymore to use the unsafe -Ofast (a.k.a. -Ounchecked); plain -O produces equally good code.


Here is another "Swift 100 times slower than C" question: stackoverflow.com/questions/24102609/…

2019年05月24日05分49秒

And here is discussion on Apple's marketing material related to Swift's good performance in sorting: programmers.stackexchange.com/q/242816/913

2019年05月23日05分49秒

You can compile with: xcrun --sdk macosx swift -O3. It's shorter.

2019年05月23日05分49秒

This link shows some other basic operations in comparison to Objective-C.

2019年05月24日05分49秒

With Beta 5 there has been substantial improvement in Swift's speed -- see this post by Jesse Squires for more detail.

2019年05月23日05分49秒

Using -emit-sil to output the intermediate SIL code shows what's being retained (argh, stack overflow is making this impossible to format). It's an internal buffer object in the Array. This definitely sounds like an optimizer bug, the ARC optimizer should be able to remove the retains without -Ofast.

2019年05月24日05分49秒

'll Just disagree that we have to use another language if want to use Ofast optimizations. It will have to deal similarly with the question of bounds checks and other minor problems if pick another language like C. The swift is cool precisely because it is to be secure by default and optionally fast and insecure if needed. This allows the programmer to debug your code as well, to make sure everything is ok and compile using Ofast. The possibility of using modern standards and yet have the power of an "unsafe" language like C is very cool.

1970年01月01日00分09秒

b5 should improve this a bunch as well

2019年05月24日05分49秒

made a final update, Swift is now as fast as C by this benchmark using standard optimisations.

2019年05月24日05分49秒

Tip: Both your Swift and C implementations of quicksort can be improved if your recurse on the smallest partition first! (Instead of recursing on the left partition always first.) Quicksort implemented with a simple pivot selection in the worst case takes O(n^2) time, but even in this worst case you only need O(log n) stack space by recursing on the smaller partition first.

2019年05月23日05分49秒

The compiler might not be able to unbox the array or the array elements, though, since they're getting passed to sort(), which is an external function and has to get the arguments it's expecting. That should not matter to a relatively good compiler. Passing metadata (in the pointer - 64bits offer a lot of levee) about the actual data and branching it in the called function.

2019年05月23日05分49秒

What exactly makes -Ofast "totally unsafe"? Assuming you know how to test your code and rule out overflows.

2019年05月24日05分49秒

sjeohp: That's actually assuming a lot :-) Checking the code and ruling out overflows is hard to do. From my experience (I do compiler work and have checked some big codebases), and what I've heard from people who do compiler work on huge companies, getting overflows and other undefined behavior right is hard. Even Apple's advice (just an example) on fixing UB is wrong, sometimes (randomascii.wordpress.com/2014/04/17/… ). -Ofast also changes language semantics, but I can't fund any docs for it. How can you be confident you know what it's doing?

2019年05月24日05分49秒

bestsss: It's possible, but it might not be useful. It adds checks on every access to an Int[]. It depends if arrays of Int and a few other primitive types (you have, at most, 3 bits) are used a lot (especially when you can lower to C if you need to). It also uses up some bits that they might want to use if, eventually, they want to add non-ARC GC. It doesn't scale to generics with more than one argument, either. Since they have all the types, it would be much easier to specialize all code that touched Int[] (but not Int?[]) to use inlined Int. But then you have Obj-C interop to worry about.

2019年05月24日05分49秒

filcab, non-ARC (i.e. real) GC would be actually useful but they need something that's not C compatible if they want a truly concurrent, non-STW GC. I'd not worry about 'every access to Int[]' since that depends on the level the compiler can inline and it should be able to inline the tight loops with/after some guidance.

2019年05月24日05分49秒

In reality you should never not call vector::reserve() before inserting ten million elements.

2019年05月24日05分49秒

Perhaps! Only the sort is being timed at the moment.

2019年05月23日05分49秒

Perhaps I am completely wrong, but according to en.wikipedia.org/wiki/Quicksort, the average number of comparisons in Quicksort is 2*n*log(n). That is 13815 comparisons for sorting n = 1000 elements, so if the comparison function is called about 11000 times that does not seem so bad.

2019年05月23日05分49秒

Also Apple claimed that a "complex object sort" (whatever that is) is 3.9 times faster in Swift than in Python. Therefore it should not be necessary to find a "better sorting function". - But Swift is still in development ...

2019年05月24日05分49秒

It does refer to the natural logarithm.

2019年05月24日05分49秒

log(n) for algorithmic complexity conventionally refers to log base-2. The reason for not stating the base is that the change-of-base law for logarithms only introduces a constant multiplier, which is discarded for the purposes of O-notation.

2019年05月23日05分49秒

Regarding the discussion about natural logarithm vs base 2 logarithm: The precise statement from the Wikipedia page is that the average number of comparisons needed for n elements is C(n) = 2n ln n ≈ 1.39n log₂ n. For n = 1000 this gives C(n) = 13815, and it is not a "big-O notation".

2019年05月24日05分49秒

This answer is now out of date. As of Swift 4.1 the whole module optimisation option is a separate boolean that can be combined with other settings and there is now an -Os to optimise for size. I may update when I have time to check the exact option flags.

2019年05月24日05分49秒