> However, dynamic compilation has to be even more mindful of the costs of optimisation, since time spent dynamically compiling is bad for the user (draining their battery or simply taking potential CPU time away from the user). Thus while dynamic compilation is the most effective way of determining which parts of a program will benefit most from optimisation, it has much less leeway to use expensive optimisations than static compilation.
WebKit used LLVM as its optimizing JIT compiler for a couple years, and it was fine. We only switched away from LLVM when we wrote a compiler that generated code that was as good as LLVM (for our narrow use case) but in less compile time; we would probably switch back to LLVM (or any other heavy AOT-style compiler) if it gave us more throughput than our current compiler. In general, we're happy to pay compile time to get peak throughput.
The reason why we have this attitude is that we have more than one JIT. WebKit has three JIT tiers for JS and two for WebAssembly. The top tier does crazy expensive optimizations (the most expensive of which is a must-points-to analysis for removing object allocations; in second place is the Yorktown-style regalloc) and has long compile times, but it's OK because by the time we run it, we've already done a pretty good job of optimizing the code with the JIT tier just below the top one. The second-to-top JIT tier in WebKit, called the DFG JIT, runs very quickly.
Overall this is a great post! I just think that there is even more potential for perf with JITs than what the post implies.
> Shouldn't a program which runs a UI (which WebKit basically is) optimize for lowest latency instead?
Do you want to achieve low latency with low latency, or do you want low latency and it's so important that you don't mind high latency to get it :)
What I mean is yes people want lowest latency. But you can either compile quickly, to get to a reasonably low latency when using the app and to get to that point itself with low latency, or you can compile with optimisations, and get lower latency when the compilation is done but have to wait with high latency to get to that point.
Both approaches are 'optimising for latency' so you'd need to be more specific.
The practical compromise is tiers of compilation, each taking higher latency to apply but producing lower latency when they're done.
> Shouldn't a program which runs a UI (which WebKit basically is) optimize for lowest latency instead?
Note that this is only the third (top) tier JIT (or fourth if you include the interpreter). By that point, you're really optimising for throughput and not compile time.
All the browsers do their top-tier jit on a separate thread (if not multiple threads), as far as I know, precisely so it doesn't interfere with execution.
One neat thing about Common Lisp is that you can specify compilation directives in an extremely granular way: per file, per function, and even in a scope-delimited fashion within a part of a function. You can do this with Lisp's DECLAIM, DECLARE, and LOCALLY DECLARE syntax respectively.
Compiler directives are given as numerical levels of speed, safety, compilation speed, code size, and debuggability. There are other more specialized directives beyond those.
If you're building a web app and the backend has 2 or 3 really hot code paths, you can do something like:
This roughly reads: "For ONLY the function HOT-PATH, compile it only with execution speed in mind. Eliminate extra runtime checks like array bounds checking or overflow, because I have meticulously checked and proved that to not be an issue. While you're at it, don't mind how debuggable it is at runtime; don't record anything about the call or it's internal workings, perform tail call elimination, etc."
I'll usually abstract away the DECLARE with more semantically meaningful terms: OPTIMIZE-DANGEROUSLY-FAST, OPTIMIZE-FOR-SCRUTINY, etc.
In general, in Lisp, I would say it's even considered bad hygiene to flip the global optimization switch; as djb and the article say, most of your large programs don't need it. You also often sacrifice debuggability in the entirety of your program if you ask the compiler to do what it can to make it fast.
The problem with directives is not on day 1, they're awesome on day 1. The problem is usually that the directives are in the wrong place 10 years later.
Loop unrolling directives went through that process in just a few years, from giving a nice speedup, to compilers being able to mostly match the directive, to the directive causing slowdowns, to compilers ignoring that directive.
... at which point you can edit a single point in the source code to fix it, either defining hot-path to be equivalent to eval, removing it outright, or altering it to be better. Isn't that the point of abstraction?
Alternatively, Alive allows users to specify rewrite rules for peephole optimizations. The rules are verified with Z3 and then compiled into an LLVM pass. If the rewrite can't be verified, then Alive won't allow it.
And then there's STOKE, which uses stochastic search to investigate possible program transformations of x86 to find non-obvious code sequences. STOKE also verifies the resulting transformation with Z3.
> [O]ptimising compilers are operating in a region between a problem that’s not worth solving (most parts of most programs) and a problem they can’t solve (the really performance critical parts).
I know this isn't the position being put forward or defended by the author but I feel it's worth addressing nonetheless.
The flaw in this logic is that whether a problem is worth solving or not depends not only on the benefits of doing so, but also the costs. In this case, a compiler which optimises for size can, for almost zero cost, provide a modest benefit. This is absolutely worth doing.
What we need is not to drop the optimising compiler entirely, but make the parts compilers are good at available for programers to use in a more controlled way. Something along the lines of what https://www.confluent.io/blog/turning-the-database-inside-ou... does to the database.
One might do something similar to what's done in high-assurance development. Namely, they do source-to-object code checks that ensure compiler optimizations haven't broken. Some of the IDE's will even pull up the code for you side-by-side. There's also CompSci and industrial work on verification tools that verify properties of C or SPARK; verify assembly properties or equivalence of two pieces of code. A short-cut might be to do a certified compilation to get code that definitely works, then do full-optimization of same function, and run the results of each through equivalence checks (formal or thorough testing).
I was already looking to doing something similar after seeing VeLLVM project and KLEE. Idea was to make an optimizing compiler generate each intermediate step in C code so KLEE can equivalence check it. Once low-level enough, do that in KLEE and/or VeLLVM. Naturally, this is so time-consuming that one would do it for code that doesn't change much or is just really critical. The component handling the commits in your link would be a candidate.
My hope is that giving the programmer more control over optimization, and keeping better track of program properties, would together allow more optimizations that are correct by construction. Like, give the developer visibility into whether an operation is going to be hoisted out of a loop by requiring them to express the fact that it doesn't change (or at least showing them why this was/wasn't inferred). And then a change that makes that stop working could (in the cases where the programmer expressed that they cared about the performance) become a compilation failure rather than a silent performance regression. As a thoroughly trivial example, scala's @tailrec works like this. At the moment it's only really possible because TRO is applied in Scala as a near-syntactic-level transformation rather than a low-level optimization, but if the transformation from syntax to machine code was more gradual and visible, maybe the same kind of thing could be applied more generally, and we could e.g. incorporate performance characteristics into the type system under understandable composition rules.
Yes I think this would be useful for solving the "order of optimization passes" problem. Right now I just treat my compiler as a black box. But if the guts were exposed a little bit, the compiler and I could work together to optimize a piece of code.
I could give it a few hints about what order might work, and it could show me what the intermediate representations are so that I can keep them in mind while writing code. There are patterns in C/C++ that trip up compilers, like unintentional aliasing or out params, and I could try to avoid them to help the optimizer. On the other hand, if the optimizer doesn't are, I will use them for my own convenience.
I suppose Clang has a pretty well-defined interface between passes, in the LLVM IR, so you can do this to some degree now. But it is a little obscure and probably version-dependent.
The problem is that there's no guarantee of stability in the IR, so either your stuck pinned to an old version of LLVM, or scrambling to keep up with each breaking change.
I used to think that, but I no longer do. Most code is plenty fast enough already, sure, but for the cases where performance is a requirement, it needs to be exposed and accessible in a way that the developer can work with it, the same way as for any other feature. Compare what's happened with 3D graphics APIs - OpenGL tried to abstract over the hardware capabilities and give the illusion that your card was always capable of everything, but actually that just resulted in developers having to almost reverse engineer what the OpenGL implementation was doing so that they could find out what the actual hardware fast path was and use it. So newer APIs concentrate on exposing more of the hardware details and giving the developer more precise control over what's going on.
What distinction are you drawing? I think concerns like "this operation is O(n^2) in the length of the passed list" are absolutely the responsibility of the language, and probably even lower-level details. I'm not sure how to expose "this operation takes n machine-code instructions" without sacrificing cross-platformness, but I do think that's something we're going to need for important use cases (e.g. cryptographic libraries need to be able to do various things in constant time, which no existing language really has first-class support for).
This is true in many cases, but there's subtleties.
Rust was designed to be optimized; this means unoptimized Rust code can easily be 2x-100x slower than optimized code. Given that speed is one of Rust's goals, yeah, you could implement a Rust compiler with no optimization, and that's fine. But it won't be usable to a significant chunk of users who rely on the speed.
Another way this can affect language design: Rust supports enums, which are a sum type. The memory layout of the enum is undefined at the language level. This is specifically so that compilers can do optimizations to make layout more compact than a naive implementation. In some sense, this is leaking optimization concerns out into language design.
I was imagining direct language ties to (say) LLVM passes, and things like that.
Designing the language around optimizability is a good thing, e.g. C++'s "as if" rule (in [intro.execution] in standards after C++11 (or near there)). Some of these options, which may or may not be used for optimization, can also be very useful for code quality/correctness: e.g. Language enforced purity (of functions) (in D) can be very useful for optimization (It's inferred in most compilers, but it can at least save a trip) and being able to trust (even a D function pointer) to be pure.
Does profile-guided optimisation really 'hone in on the parts of a program which will most benefit from optimisation' ?
I thought that all it really did, at least on gcc, was record which branches were taken and optimise the code to reflect the common path. Do compilers really make use of the profile timings to decide to spend more time optimising 'hot' parts of the code with more involved/expensive transformations?
Last I checked, gcc in PGO mode can basically do things like "-O2 for hot code, -Os for everything else".
MSVC in PGO/LTO mode will do all sorts of exciting things. For example, in hot code it will detect that a virtual function call is always or nearly always happening with the same vtable pointer and emit a guard on the vtable pointer followed by an inlining of that one particular implementation (with a fallback to a normal virtual call if the guard fails). I'm pretty sure it doesn't do this for all virtual function calls in the program. Even if it only does it for ones where it has "enough data", that would bias towards the hot ones.
PGO can also record loop sizes, which helps with decisions about vectorization, prefetching, non-temporal stores, etc etc. Call counts are useful for inlining and function specialization decisions. I don't know what gcc does, but Open64 does all of the above.
> Do compilers really make use of the profile timings to decide to spend more time optimising 'hot' parts of the code with more involved/expensive transformations?
Not all code has small single locations where the CPU spends all its time. Take the linux kernel, what part is the 10% that needs optimization first? Depending on which benchmark you run you will get a different answer, so having really good code generation everywhere helps considerably. So, yes there is a place for hand optimized compression routines, but there is also a place for compilers to optimize the remaining 99% of code.
These days, with CPU vendors putting in massive effort for <10% performance improvements, getting a 2x speedups in applications that compile to hundreds of MB's, is more "free" performance than you are likely to gain over the next decade from the hardware vendors.
Lastly, the kinds of code paths most suited to these gains are also the hardest to benchmark. Which is why I fall into the camp that believes picking a fairly performant language is more valuable than a small bump in developer productivity for a little syntactic sugar for anything more complex than throw away code.
> Take the linux kernel, what part is the 10% that needs optimization first?
Well syscall entry/exit would be a contender, but those are already in hand-written assembly anyway, so that doesn't count.
Then I'd go for the read()/write() syscalls; I bet (not verified) that they're called really often in comparison to the other ones, with all the file/socket/IPC I/O that's going on.
I'm in the same camp about language choice though; there's lots of place for slower and easier languages, like Python in the role of gluing together high-performance C code, but there are still also lots of scenarios where an across-the-board performant language seems a good fit. Compilers spring to mind, but they are even an extreme example.
The article focuses on throughput performance, which is obviously very important, but maybe it's also worth mentioning that code size matters a lot in some cases, like delivering code on the web.
If an optimizing compiler can reduce code size by 50%, that's extremely useful, and as opposed to throughput optimization for which perhaps only a few small methods matter, for code size the whole program matters, so humans can't compete with compilers there.
Not just that, for many, many applications these days, latency and predictability are more important than throughput.
And throughput is actually easier (see: Latency Lags Bandwidth). So we're often/usually getting help with the less important problem that's easier to solve, while at the same time making the more important and harder problems more difficult.
Paul Graham once made a point about designing languages for yourself, or for people who are not as smart as you are, or for people who are smarter than you are. https://youtu.be/agw-wlHGi0E?t=485
Somehow this blog and the debate with djb reminds me of this point. What kind of person is are the compiler optimizations intended to benefit?
The author mentions how Chrome & Firefox gained a 10% speedup with Profile-Guided Optimization with GCC et al.
I.e. you compile your program with profiling, then run e.g. a test suite, and compile your release binary with input from that profiling so the compiler knows what to optimize.
Has any C compiler closed the loop on that by shipping the full compiler with the program to do runtime optimization? With on-the-fly symbol replacement you'd effectively have a C JIT compiler.
To start, you don't have to do it that way.
You can use sampling based profiling so that there is no training set necessary:
http://dl.acm.org/citation.cfm?id=2854044
Past that, LLVM is already like this, it's just a matter of shipping bitcode, and outputting the currently optimized bitcode at shutdown.
It is actually done in practice, but admittedly, only by those who care tremendously about performance.
(IE you don't find it done in common linux distros).
Particularly in a datacenter world, where you have hypervisors and things under the covers anyway, it's not a huge deal to have a JIT running.
The harder part is actually, if you are a cloud provider, getting anyone to bring bitcode with them :)
Even if you promise them X% performance gain to start, where X is usually 10-30%, they don't usually care enough to modify their workflow.
> Has any C compiler closed the loop on that by shipping the full compiler with the program to do runtime optimization?
- The truffleruby "cext" project demonstrated something like this. They interpret and then JIT C on the JVM, and do cross-language inlining and other optimizations.
- There are also interactive C(++) "JITs" from CERN and the Julia project. However, as far as I know, they are both method JITs and don't do any trace-based re-JIT/OSR.
Google has this in production. According to https://research.google.com/pubs/pub45290.html AutoFDO is deployed to hundreds of binaries at Google. Over half of CPU cycles used are now spent in some flavor of FDO-optimized binaries. And that was in 2016.
The non-trivial part is that most optimization are contextual and not "one function at a time". Re-optimizing only a single function may not even be possible/useful with things like partial inlining/outlining, etc.
If you start with optimized code from the AOT compiler, and do sampling based profiling every so often, the overhead is miniscule
The only time it would probably be a net lose is when you have a very large number of short lived executions.
Err, people already do it, but you are probably not going to see widespread adoption anytime soon no matter what.
Until nobody has to think about it, nothing changes
Regarding undefined behavior, one thing I very often assume is 2's complement math with wraparound on overflow. That assumption has never failed for me, so it really bothers me that this is undefined in the standard.
I have used processors that have saturating arithmetic and I even used that feature, but even there compiler defaulted to the more common rollover behavior.
For C, the standard specifies that unsigned arithmetic wraps around, while integer arithmetic overflow is undefined.
I once suggested, back when there were machines that weren't byte-oriented or twos complement, that wraparound arithmetic should be requested with
unsigned int x;
x = (x + 1) % 0xffff;
or something similar. That way, you get the same result on all platforms, regardless of word length. If the hardware can truncate cheaply, the compiler can optimize "%" such expressions easily. Other than that, overflow should be an error.
This is true for any ISO standardized language: C, C++, ECMAScript/Javascript, COBOL, Ada, Forth, Fortran, Pascal, Ruby, SQL and more. C# dropped their ISO standard after 2.0
I am not sure if they use any special legal construct, but last time I played around with Ada, the full Ada standard was available for free. The actual standard, not some revised draft. On top of that the rationale document which explains many of the design decisions was also available for free.
(It's been a while, but I found these rationale documents fascinating reading.)
I don't know about the other languages, but at least with regards to ECMAScript the specification can be accessed freely at https://www.ecma-international.org/ecma-262/7.0/index.html - though the License chapter says that "after approval all rights on the standard are reserved by Ecma International."
I think ISO would say the standard is open, but not free. You're not required to be a member of a working group to access the final standard, you just have to buy the book.
Which has created a history of languages that usually follow the standard quite strictly.
(Though, I do personally despise the non-free aspect, I can sort of understand it, considering history and the effort to standardise).
Just as a data point I wanted to see what the benefit of JVM HotSpot is over JVM just interpreting. Code that normally runs in 22 minutes with HotSpot is still running 99minutes later in interpreter only mode!
I am starting to think that the complaints about optimizing compilers are due to C family undefined behavior leading to C programmers having a love hate relationship with their compiler writers. In contrast Java programmers trust their JIT compilers to not screw up. But then UB is something that can cause issues even if there was no optimizing compiler at all.
So to conclude I am extremely grateful that there are optimizing compiler writers out there.
Another option is to compare a minimally optimizing compiler with a more optimizing compiler. This one can do in Java land with client/server flags. Client takes for our workload 15% more time to run than server.
Now what is that 15% worth? Take the new AMD Epyc to go from 2Ghz to 2.2Ghz is worth 800USD (3200->4000). That means running JVM -server gives a value at that level of about 1/3 of the server CPU cost over running the same JVM -client. Of course that is at the top level and would be less in the middle.
WebKit used LLVM as its optimizing JIT compiler for a couple years, and it was fine. We only switched away from LLVM when we wrote a compiler that generated code that was as good as LLVM (for our narrow use case) but in less compile time; we would probably switch back to LLVM (or any other heavy AOT-style compiler) if it gave us more throughput than our current compiler. In general, we're happy to pay compile time to get peak throughput.
The reason why we have this attitude is that we have more than one JIT. WebKit has three JIT tiers for JS and two for WebAssembly. The top tier does crazy expensive optimizations (the most expensive of which is a must-points-to analysis for removing object allocations; in second place is the Yorktown-style regalloc) and has long compile times, but it's OK because by the time we run it, we've already done a pretty good job of optimizing the code with the JIT tier just below the top one. The second-to-top JIT tier in WebKit, called the DFG JIT, runs very quickly.
Overall this is a great post! I just think that there is even more potential for perf with JITs than what the post implies.