I am curious why the TPDE paper does not mention the Copy-And-Patch paper.
That is a technique that uses LLVM to generate a library of patchable machine code snippets, and during actual compilation those snippets are simply pasted together. In fairness, it is just a proof of concept: they could compile WASM to x64 but not C or C++.
There's a longer paragraph on that topic in Section 8. We also previously built an LLVM back-end using that approach [1]. While that approach leads to even faster compilation, run-time performance is much worse (2.5x slower than LLVM -O0) due to more-or-less impossible register allocation for the snippets.
> run-time performance is much worse (2.5x slower than LLVM -O0)
How come? The Copy-and-Patch Compilation paper reports:
> The generated code runs [...] 14% faster than LLVM -O0.
I don't have time right now to compare your approach and benchmark to theirs, but I would have expected comparable performance from what I had read back then.
The paper is rather selective about the used benchmarks and baselines. They do two comparisons (3 microbenchmarks and a re-implementation of a few (rather simple) database queries) against LLVM -- and have written all benchmarks themselves through their own framework. These benchmarks start from their custom AST data structures and they have their own way of generating LLVM-IR. For the non-optimizing LLVM back-end, the performance obviously strongly depends on the way the IR is generated -- they might not have put a lot of effort into generating "good IR" (=IR similar to what Clang generates).
The fact that they don't do a comparison against LLVM on larger benchmarks/functions or any other code they haven't written themselves makes that single number rather questionable for a general claim of being faster than LLVM -O0.
This is in relation to their TPCH benchmark which can be due to a variety of reasons. My guess would be that they can generate stencils for whole operators which can be transformed into more efficient code at stencil generation time while LLVM-O0 gets the operator in LLVM-IR form and can do no such transformation. Though I can't verify this because their benchmark setup seems a bit more involved.
When used in a C/C++ compiler the stencils correspond to individual (or a few) LLVM-IR instructions which then leads to bad runtime performance. Also as mentioned, on larger functions register allocation becomes a problem for the Copy-and-Patch approach.
I have no relation to the authors.
https://fredrikbk.com/publications/copy-and-patch.pdf