Bootstrapping GNU userland is a major pain point, and it's great to see GNU Guix taking it so seriously. There's even a section of their manual dedicated to it [1].
This is important, even if you don't intend to run GNU Guix itself. I care about building GNU userland for embedded targets, and even with a build system like Yocto, which builds specific versions of the entire host userland, you can get errors due to eg. an older gnu m4 not being buildable with a modern glibc. Thus, you end up essentially requiring developers to build inside a container.
This is paving the way for having everything required to build a piece of software just checked into the git repository. If you want to change it, you just commit it like everything else.
As well as the diagram on that page, there is a nice document in one of the repos listing the bootstrapping steps in order, from "stage0" to "gcc 4.7.4" and beyond:
Bootstrappability is the reason why GNU Guix is one of the most ever important software projects. Moreover, I see Guix as the unifying force for the entire GNU project. It is the perfect playground to test and develop emerging tech (Shepherd, Hurd, &tc.).
Bootstrapping gcc and libc was one of the time honoured traditions I went through as a teenager with a Linux box. Learning how to set up gcc as a cross compiler to produce ELF binaries on an a.out system (or generating m68k binaries on my 486) was quite helpful later in my career when working on embedded systems. Granted, a lot of this was necessary as SLS and Slackware didn't exactly have very capable packaging systems at the time.
I find this fascinating and am a big supporter of FSF and GNU.
All that said, I am not an expert so would like to learn more. Can somebody let me know why one cannot just take the assembly version of an existing compiler and carefully review its code to be happy with it and then build everything from that verified compiler? Why does it need so many steps?
I think you might find an answer for that question in the GCC 4.7 step. They target that version because all GCC versions afterwards include a C++ compiler in addition to the C one. Each successive step is a greater level of complexity. By starting small in the beginning, they have a codebase that is easier to audit than a full blown "modern" GCC or LLVM. That's the idea, at least.
> all GCC versions afterwards include a C++ compiler in addition to the C one
GCC 4.7 is the last version that can be built from source using only a C compiler. GCC has long included the C++ compiler inside, but didn’t require one to build until 4.8.
How do you know the Linux kernel you are running that verified compiler on isn't subverting the compiler? The only way to do bootstrapping sanely is to start from some manually written machine code (not assembler) and eventually reach Linux/GCC/etc. This is the approach being taken by Bootstrappable Builds.
> Can somebody let me know why one cannot just take the assembly version of an existing compiler and carefully review its code to be happy with it and then build everything from that verified compiler? Why does it need so many steps?
Because your current OS to load the assembly code may have been poisoned to present you with a sanitized version on the compiler.
It's worth noting that the above comment, while it might sound paranoid to some, is IMO entirely justified.
I'm 50/50 on whether someone at some point hasn't executed a successful Trusting Trust attack (see Ken Thompson). With modern machines that have megabytes of binary blobs, different co-processors that have access to the RAM while they can't be reprogrammed to be on the user's side, and techniques that can actually tell when sensitive operations are happening, such attacks are becoming more feasible.
That's only half of the trusting trust-attack though; the other half is being able to make the compiler compromise propagate itself, i.e. not just inserting any backdoor in compiled code, but inserting itself in any compiler built using the compromised tool.
Mobile and Electron apps often weight hundreds of megs. That's enough data to hide an entire classic-style OS in the spaces between the data. While I don't know whether someone did insert such a recursive compiler, they certainly can do it unobtrusively enough that it doesn't raise any suspicion.
Super interesting, but I wonder if there is a historical accurate account of bootstrapping that is well documented. This jumps into ELF pretty quickly but there were older formats like a.out and OMAGIC? before ELF. Is there good documentation on the bootstrapping of x86 BSD or Minix since they are even older than Linux.
This is important, even if you don't intend to run GNU Guix itself. I care about building GNU userland for embedded targets, and even with a build system like Yocto, which builds specific versions of the entire host userland, you can get errors due to eg. an older gnu m4 not being buildable with a modern glibc. Thus, you end up essentially requiring developers to build inside a container.
This is paving the way for having everything required to build a piece of software just checked into the git repository. If you want to change it, you just commit it like everything else.
[1] https://guix.gnu.org/manual/en/guix.html#Bootstrapping