What Challenges and Trade-Offs Do Optimising Compilers Face?

pizlonator · on June 22, 2017

> However, dynamic compilation has to be even more mindful of the costs of optimisation, since time spent dynamically compiling is bad for the user (draining their battery or simply taking potential CPU time away from the user). Thus while dynamic compilation is the most effective way of determining which parts of a program will benefit most from optimisation, it has much less leeway to use expensive optimisations than static compilation.

WebKit used LLVM as its optimizing JIT compiler for a couple years, and it was fine. We only switched away from LLVM when we wrote a compiler that generated code that was as good as LLVM (for our narrow use case) but in less compile time; we would probably switch back to LLVM (or any other heavy AOT-style compiler) if it gave us more throughput than our current compiler. In general, we're happy to pay compile time to get peak throughput.

The reason why we have this attitude is that we have more than one JIT. WebKit has three JIT tiers for JS and two for WebAssembly. The top tier does crazy expensive optimizations (the most expensive of which is a must-points-to analysis for removing object allocations; in second place is the Yorktown-style regalloc) and has long compile times, but it's OK because by the time we run it, we've already done a pretty good job of optimizing the code with the JIT tier just below the top one. The second-to-top JIT tier in WebKit, called the DFG JIT, runs very quickly.

Overall this is a great post! I just think that there is even more potential for perf with JITs than what the post implies.

pjmlp · on June 22, 2017

> Overall this is a great post! I just think that there is even more potential for perf with JITs than what the post implies.

Specially if the JIT can reuse PGO from previous execution runs like MGO on .NET 4.5+ or Android 7.

https://blogs.msdn.microsoft.com/dotnet/2012/03/20/improving...

https://source.android.com/devices/tech/dalvik/jit-compiler

amelius · on June 22, 2017

Interesting. Does WebKit run its JIT compiler in a separate thread from the actual execution?

> In general, we're happy to pay compile time to get peak throughput.

Shouldn't a program which runs a UI (which WebKit basically is) optimize for lowest latency instead?

chrisseaton · on June 22, 2017

> Shouldn't a program which runs a UI (which WebKit basically is) optimize for lowest latency instead?

Do you want to achieve low latency with low latency, or do you want low latency and it's so important that you don't mind high latency to get it :)

What I mean is yes people want lowest latency. But you can either compile quickly, to get to a reasonably low latency when using the app and to get to that point itself with low latency, or you can compile with optimisations, and get lower latency when the compilation is done but have to wait with high latency to get to that point.

Both approaches are 'optimising for latency' so you'd need to be more specific.

The practical compromise is tiers of compilation, each taking higher latency to apply but producing lower latency when they're done.

gsnedders · on June 22, 2017

> Shouldn't a program which runs a UI (which WebKit basically is) optimize for lowest latency instead?

Note that this is only the third (top) tier JIT (or fourth if you include the interpreter). By that point, you're really optimising for throughput and not compile time.

bzbarsky · on June 23, 2017

All the browsers do their top-tier jit on a separate thread (if not multiple threads), as far as I know, precisely so it doesn't interfere with execution.

reikonomusha · on June 22, 2017

One neat thing about Common Lisp is that you can specify compilation directives in an extremely granular way: per file, per function, and even in a scope-delimited fashion within a part of a function. You can do this with Lisp's DECLAIM, DECLARE, and LOCALLY DECLARE syntax respectively.

Compiler directives are given as numerical levels of speed, safety, compilation speed, code size, and debuggability. There are other more specialized directives beyond those.

If you're building a web app and the backend has 2 or 3 really hot code paths, you can do something like:

    (defun hot-path (request)
      (declare (optimize (speed 3) (safety 0) (debug 0)))
      ...)

This roughly reads: "For ONLY the function HOT-PATH, compile it only with execution speed in mind. Eliminate extra runtime checks like array bounds checking or overflow, because I have meticulously checked and proved that to not be an issue. While you're at it, don't mind how debuggable it is at runtime; don't record anything about the call or it's internal workings, perform tail call elimination, etc."

I'll usually abstract away the DECLARE with more semantically meaningful terms: OPTIMIZE-DANGEROUSLY-FAST, OPTIMIZE-FOR-SCRUTINY, etc.

In general, in Lisp, I would say it's even considered bad hygiene to flip the global optimization switch; as djb and the article say, most of your large programs don't need it. You also often sacrifice debuggability in the entirety of your program if you ask the compiler to do what it can to make it fast.

wumpus · on June 22, 2017

The problem with directives is not on day 1, they're awesome on day 1. The problem is usually that the directives are in the wrong place 10 years later.

Loop unrolling directives went through that process in just a few years, from giving a nice speedup, to compilers being able to mostly match the directive, to the directive causing slowdowns, to compilers ignoring that directive.

fao_ · on June 23, 2017

... at which point you can edit a single point in the source code to fix it, either defining hot-path to be equivalent to eval, removing it outright, or altering it to be better. Isn't that the point of abstraction?

moomin · on June 23, 2017

A misplaced directive is probably preferable to misplaced manual optimisation, though.

krylon · on June 23, 2017

Amen, I wish more popular languages would choose a similar approach.

chrisaycock · on June 22, 2017

I've been fascinated with recent advances in compiler optimization.

The article mentioned superoptimization. Google's Souper uses an SMT solver (like Z3) to make LLVM peephole optimizations.

https://github.com/google/souper

Alternatively, Alive allows users to specify rewrite rules for peephole optimizations. The rules are verified with Z3 and then compiled into an LLVM pass. If the rewrite can't be verified, then Alive won't allow it.

https://github.com/nunoplopes/alive

And then there's STOKE, which uses stochastic search to investigate possible program transformations of x86 to find non-obvious code sequences. STOKE also verifies the resulting transformation with Z3.

https://github.com/StanfordPL/stoke

taneq · on June 22, 2017

> [O]ptimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts).

I know this isn't the position being put forward or defended by the author but I feel it's worth addressing nonetheless.

The flaw in this logic is that whether a problem is worth solving or not depends not only on the benefits of doing so, but also the costs. In this case, a compiler which optimises for size can, for almost zero cost, provide a modest benefit. This is absolutely worth doing.

lmm · on June 22, 2017

What we need is not to drop the optimising compiler entirely, but make the parts compilers are good at available for programers to use in a more controlled way. Something along the lines of what https://www.confluent.io/blog/turning-the-database-inside-ou... does to the database.

nickpsecurity · on June 22, 2017

One might do something similar to what's done in high-assurance development. Namely, they do source-to-object code checks that ensure compiler optimizations haven't broken. Some of the IDE's will even pull up the code for you side-by-side. There's also CompSci and industrial work on verification tools that verify properties of C or SPARK; verify assembly properties or equivalence of two pieces of code. A short-cut might be to do a certified compilation to get code that definitely works, then do full-optimization of same function, and run the results of each through equivalence checks (formal or thorough testing).

I was already looking to doing something similar after seeing VeLLVM project and KLEE. Idea was to make an optimizing compiler generate each intermediate step in C code so KLEE can equivalence check it. Once low-level enough, do that in KLEE and/or VeLLVM. Naturally, this is so time-consuming that one would do it for code that doesn't change much or is just really critical. The component handling the commits in your link would be a candidate.

lmm · on June 22, 2017

My hope is that giving the programmer more control over optimization, and keeping better track of program properties, would together allow more optimizations that are correct by construction. Like, give the developer visibility into whether an operation is going to be hoisted out of a loop by requiring them to express the fact that it doesn't change (or at least showing them why this was/wasn't inferred). And then a change that makes that stop working could (in the cases where the programmer expressed that they cared about the performance) become a compilation failure rather than a silent performance regression. As a thoroughly trivial example, scala's @tailrec works like this. At the moment it's only really possible because TRO is applied in Scala as a near-syntactic-level transformation rather than a low-level optimization, but if the transformation from syntax to machine code was more gradual and visible, maybe the same kind of thing could be applied more generally, and we could e.g. incorporate performance characteristics into the type system under understandable composition rules.

chubot · on June 22, 2017

Yes I think this would be useful for solving the "order of optimization passes" problem. Right now I just treat my compiler as a black box. But if the guts were exposed a little bit, the compiler and I could work together to optimize a piece of code.

I could give it a few hints about what order might work, and it could show me what the intermediate representations are so that I can keep them in mind while writing code. There are patterns in C/C++ that trip up compilers, like unintentional aliasing or out params, and I could try to avoid them to help the optimizer. On the other hand, if the optimizer doesn't are, I will use them for my own convenience.

I suppose Clang has a pretty well-defined interface between passes, in the LLVM IR, so you can do this to some degree now. But it is a little obscure and probably version-dependent.

DerSaidin · on June 22, 2017

For a human reading LLVM IR, it hasn't changed much between versions since at least 2.6.

ChickeNES · on June 23, 2017

The problem is that there's no guarantee of stability in the IR, so either your stuck pinned to an old version of LLVM, or scrambling to keep up with each breaking change.

mhh__ · on June 22, 2017

Optimizing compilers are purely an implementation detail of a language: This shouldn't leak into the language (If that's what you meant)

lmm · on June 22, 2017

I used to think that, but I no longer do. Most code is plenty fast enough already, sure, but for the cases where performance is a requirement, it needs to be exposed and accessible in a way that the developer can work with it, the same way as for any other feature. Compare what's happened with 3D graphics APIs - OpenGL tried to abstract over the hardware capabilities and give the illusion that your card was always capable of everything, but actually that just resulted in developers having to almost reverse engineer what the OpenGL implementation was doing so that they could find out what the actual hardware fast path was and use it. So newer APIs concentrate on exposing more of the hardware details and giving the developer more precise control over what's going on.

mhh__ · on June 23, 2017

That's the language design/philosophy, not specific details of how it was implemented.

lmm · on June 23, 2017

What distinction are you drawing? I think concerns like "this operation is O(n^2) in the length of the passed list" are absolutely the responsibility of the language, and probably even lower-level details. I'm not sure how to expose "this operation takes n machine-code instructions" without sacrificing cross-platformness, but I do think that's something we're going to need for important use cases (e.g. cryptographic libraries need to be able to do various things in constant time, which no existing language really has first-class support for).

steveklabnik · on June 22, 2017

This is true in many cases, but there's subtleties.

Rust was designed to be optimized; this means unoptimized Rust code can easily be 2x-100x slower than optimized code. Given that speed is one of Rust's goals, yeah, you could implement a Rust compiler with no optimization, and that's fine. But it won't be usable to a significant chunk of users who rely on the speed.

Another way this can affect language design: Rust supports enums, which are a sum type. The memory layout of the enum is undefined at the language level. This is specifically so that compilers can do optimizations to make layout more compact than a naive implementation. In some sense, this is leaking optimization concerns out into language design.

mhh__ · on June 22, 2017

I was imagining direct language ties to (say) LLVM passes, and things like that.

Designing the language around optimizability is a good thing, e.g. C++'s "as if" rule (in [intro.execution] in standards after C++11 (or near there)). Some of these options, which may or may not be used for optimization, can also be very useful for code quality/correctness: e.g. Language enforced purity (of functions) (in D) can be very useful for optimization (It's inferred in most compilers, but it can at least save a trip) and being able to trust (even a D function pointer) to be pure.

pcvarmint · on June 28, 2017

Fortran is better in this regard.

cwzwarich · on June 22, 2017

If you have a fixed size stack, then it's impossible for optimizations to not leak into the language, because stack overflow is an observable failure.

ACow_Adonis · on June 23, 2017

Doesn't that make the assumption that there is a hard and fast distinction between compiler and language?

That the compiler isn't fundamentally part of the language or integrated into the language itself?

I don't think that has to hold for all possible language designs...

joosters · on June 22, 2017

Does profile-guided optimisation really 'hone in on the parts of a program which will most benefit from optimisation' ?

I thought that all it really did, at least on gcc, was record which branches were taken and optimise the code to reflect the common path. Do compilers really make use of the profile timings to decide to spend more time optimising 'hot' parts of the code with more involved/expensive transformations?

bzbarsky · on June 23, 2017

Last I checked, gcc in PGO mode can basically do things like "-O2 for hot code, -Os for everything else".

MSVC in PGO/LTO mode will do all sorts of exciting things. For example, in hot code it will detect that a virtual function call is always or nearly always happening with the same vtable pointer and emit a guard on the vtable pointer followed by an inlining of that one particular implementation (with a fallback to a normal virtual call if the guard fails). I'm pretty sure it doesn't do this for all virtual function calls in the program. Even if it only does it for ones where it has "enough data", that would bias towards the hot ones.

wumpus · on June 22, 2017

PGO can also record loop sizes, which helps with decisions about vectorization, prefetching, non-temporal stores, etc etc. Call counts are useful for inlining and function specialization decisions. I don't know what gcc does, but Open64 does all of the above.

sanxiyn · on June 22, 2017

Not explicitly, but because hot functions get inlined and inlining enables optimizations, something roughly similar does happen.

Joky · on June 23, 2017

> Do compilers really make use of the profile timings to decide to spend more time optimising 'hot' parts of the code with more involved/expensive transformations?

Clang does not AFAIK.

StillBored · on June 22, 2017

Not all code has small single locations where the CPU spends all its time. Take the linux kernel, what part is the 10% that needs optimization first? Depending on which benchmark you run you will get a different answer, so having really good code generation everywhere helps considerably. So, yes there is a place for hand optimized compression routines, but there is also a place for compilers to optimize the remaining 99% of code.

These days, with CPU vendors putting in massive effort for <10% performance improvements, getting a 2x speedups in applications that compile to hundreds of MB's, is more "free" performance than you are likely to gain over the next decade from the hardware vendors.

Lastly, the kinds of code paths most suited to these gains are also the hardest to benchmark. Which is why I fall into the camp that believes picking a fairly performant language is more valuable than a small bump in developer productivity for a little syntactic sugar for anything more complex than throw away code.

tomsmeding · on June 22, 2017

> Take the linux kernel, what part is the 10% that needs optimization first?

Well syscall entry/exit would be a contender, but those are already in hand-written assembly anyway, so that doesn't count.

Then I'd go for the read()/write() syscalls; I bet (not verified) that they're called really often in comparison to the other ones, with all the file/socket/IPC I/O that's going on.

I'm in the same camp about language choice though; there's lots of place for slower and easier languages, like Python in the role of gluing together high-performance C code, but there are still also lots of scenarios where an across-the-board performant language seems a good fit. Compilers spring to mind, but they are even an extreme example.

Retra · on June 23, 2017

Aren't read()/write() going to be waiting on memory most of the time? Wouldn't you rather have faster context switching first?

tomsmeding · on June 23, 2017

Very true, didn't think of that one.

azakai · on June 22, 2017

The article focuses on throughput performance, which is obviously very important, but maybe it's also worth mentioning that code size matters a lot in some cases, like delivering code on the web.

If an optimizing compiler can reduce code size by 50%, that's extremely useful, and as opposed to throughput optimization for which perhaps only a few small methods matter, for code size the whole program matters, so humans can't compete with compilers there.

mpweiher · on June 22, 2017

Not just that, for many, many applications these days, latency and predictability are more important than throughput.

And throughput is actually easier (see: Latency Lags Bandwidth). So we're often/usually getting help with the less important problem that's easier to solve, while at the same time making the more important and harder problems more difficult.

YMMV.

lukego · on June 23, 2017

Paul Graham once made a point about designing languages for yourself, or for people who are not as smart as you are, or for people who are smarter than you are. https://youtu.be/agw-wlHGi0E?t=485

Somehow this blog and the debate with djb reminds me of this point. What kind of person is are the compiler optimizations intended to benefit?

avar · on June 22, 2017

The author mentions how Chrome & Firefox gained a 10% speedup with Profile-Guided Optimization with GCC et al.

I.e. you compile your program with profiling, then run e.g. a test suite, and compile your release binary with input from that profiling so the compiler knows what to optimize.

Has any C compiler closed the loop on that by shipping the full compiler with the program to do runtime optimization? With on-the-fly symbol replacement you'd effectively have a C JIT compiler.

DannyBee · on June 22, 2017

To start, you don't have to do it that way. You can use sampling based profiling so that there is no training set necessary: http://dl.acm.org/citation.cfm?id=2854044

Past that, LLVM is already like this, it's just a matter of shipping bitcode, and outputting the currently optimized bitcode at shutdown.

It is actually done in practice, but admittedly, only by those who care tremendously about performance.

(IE you don't find it done in common linux distros).

Particularly in a datacenter world, where you have hypervisors and things under the covers anyway, it's not a huge deal to have a JIT running.

The harder part is actually, if you are a cloud provider, getting anyone to bring bitcode with them :)

Even if you promise them X% performance gain to start, where X is usually 10-30%, they don't usually care enough to modify their workflow.

ihnorton · on June 22, 2017

> Has any C compiler closed the loop on that by shipping the full compiler with the program to do runtime optimization?

- The truffleruby "cext" project demonstrated something like this. They interpret and then JIT C on the JVM, and do cross-language inlining and other optimizations.

http://chrisseaton.com/rubytruffle/cext/

The same group is working on compiling LLVM IR on the JVM. Presumably the JVM's tiered runtime optimizations would apply to both of these.

https://github.com/graalvm/sulong

- There are also interactive C(++) "JITs" from CERN and the Julia project. However, as far as I know, they are both method JITs and don't do any trace-based re-JIT/OSR.

https://root.cern.ch/cling and https://github.com/Keno/Cxx.jl

sanxiyn · on June 22, 2017

Google has this in production. According to https://research.google.com/pubs/pub45290.html AutoFDO is deployed to hundreds of binaries at Google. Over half of CPU cycles used are now spent in some flavor of FDO-optimized binaries. And that was in 2016.

twic · on June 24, 2017

HP's Dynamo did something comparable in 2000:

http://archive.arstechnica.com/reviews/1q00/dynamo/dynamo-1....

Joky · on June 23, 2017

The non-trivial part is that most optimization are contextual and not "one function at a time". Re-optimizing only a single function may not even be possible/useful with things like partial inlining/outlining, etc.

ksk · on June 22, 2017

Well, a JIT will add overhead that will defeat the purpose of using C in the first place.

DannyBee · on June 22, 2017

If you start with optimized code from the AOT compiler, and do sampling based profiling every so often, the overhead is miniscule The only time it would probably be a net lose is when you have a very large number of short lived executions.

ksk · on June 22, 2017

Okay, and if you manage to show that it's feasible in practice, I am sure people would adopt such an approach.

DannyBee · on June 23, 2017

Err, people already do it, but you are probably not going to see widespread adoption anytime soon no matter what. Until nobody has to think about it, nothing changes

phkahler · on June 22, 2017

Regarding undefined behavior, one thing I very often assume is 2's complement math with wraparound on overflow. That assumption has never failed for me, so it really bothers me that this is undefined in the standard.

I have used processors that have saturating arithmetic and I even used that feature, but even there compiler defaulted to the more common rollover behavior.

usefulcat · on June 22, 2017

Assuming you're using gcc or something with similar options, you should have a look at -fwrapv if you haven't already.

Animats · on June 22, 2017

For C, the standard specifies that unsigned arithmetic wraps around, while integer arithmetic overflow is undefined.

I once suggested, back when there were machines that weren't byte-oriented or twos complement, that wraparound arithmetic should be requested with

    unsigned int x;
    x = (x + 1) % 0xffff;

or something similar. That way, you get the same result on all platforms, regardless of word length. If the hardware can truncate cheaply, the compiler can optimize "%" such expressions easily. Other than that, overflow should be an error.

nickpsecurity · on June 22, 2017

"but the C standard isn't even freely available (though one can get drafts, which are “close” to the final standard)"

I assumed it was free. If it's not, then it isn't even fully open as a language. Can anyone refute it with a link to free standard?

FigBug · on June 22, 2017

This is true for any ISO standardized language: C, C++, ECMAScript/Javascript, COBOL, Ada, Forth, Fortran, Pascal, Ruby, SQL and more. C# dropped their ISO standard after 2.0

krylon · on June 23, 2017

I am not sure if they use any special legal construct, but last time I played around with Ada, the full Ada standard was available for free. The actual standard, not some revised draft. On top of that the rationale document which explains many of the design decisions was also available for free.

(It's been a while, but I found these rationale documents fascinating reading.)

CapacitorSet · on June 22, 2017

I don't know about the other languages, but at least with regards to ECMAScript the specification can be accessed freely at https://www.ecma-international.org/ecma-262/7.0/index.html - though the License chapter says that "after approval all rights on the standard are reserved by Ecma International."

bzbarsky · on June 23, 2017

ECMAScript is standardized by ECMA, not ISO, so the rules are a bit different.

nickpsecurity · on June 22, 2017

There's free descriptions for quite a few of those languages available online.

shakna · on June 22, 2017

It's the way ISO does things.

C11 is here. [0]

However, the working draft immediately before is open. In this case, C1X is here [1].

[0] https://www.iso.org/standard/57853.html

[1] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

mort96 · on June 22, 2017

So... no ISO standard is really free or open, just drafts which may or may not be similar? That sounds kind of shitty.

shakna · on June 22, 2017

I think ISO would say the standard is open, but not free. You're not required to be a member of a working group to access the final standard, you just have to buy the book.

Which has created a history of languages that usually follow the standard quite strictly.

(Though, I do personally despise the non-free aspect, I can sort of understand it, considering history and the effort to standardise).

jerven · on June 22, 2017

Just as a data point I wanted to see what the benefit of JVM HotSpot is over JVM just interpreting. Code that normally runs in 22 minutes with HotSpot is still running 99minutes later in interpreter only mode!

I am starting to think that the complaints about optimizing compilers are due to C family undefined behavior leading to C programmers having a love hate relationship with their compiler writers. In contrast Java programmers trust their JIT compilers to not screw up. But then UB is something that can cause issues even if there was no optimizing compiler at all.

So to conclude I am extremely grateful that there are optimizing compiler writers out there.

jerven · on June 23, 2017

Another option is to compare a minimally optimizing compiler with a more optimizing compiler. This one can do in Java land with client/server flags. Client takes for our workload 15% more time to run than server.

Now what is that 15% worth? Take the new AMD Epyc to go from 2Ghz to 2.2Ghz is worth 800USD (3200->4000). That means running JVM -server gives a value at that level of about 1/3 of the server CPU cost over running the same JVM -client. Of course that is at the top level and would be less in the middle.

lloydjatkinson · on June 22, 2017

Title is wrong

pwdisswordfish · on June 22, 2017

> In practise, therefore, trying to optimise all parts of a program means doing a so-so job on all parts of a programme.

This reads as if the author had suddenly remembered he's British halfway through the sentence. Quite jarring.

(Not to mention that 'practise' is a verb.)

ltratt · on June 22, 2017

Mea culpa! Fixed with thanks.

otempomores · on June 23, 2017

Is there a compiler which can be attached to a debug build and optimize from the runtime-codepath information generated during usage?