The most important tidbit in today's processors (and even common processors from the early 1990s, like the 486 or MIPS), is the concept of instruction-level parallelism.
"Modern" processors (loose terminology, because the 486 and MIPS did this) can run multiple instructions __in parallel__ to other instructions. The difference is that "yesterday's" processors are very limited in the forms of parallelism that can be "automatically detected", while today's processors are so good at it that you really only need to think about dependency cutting IMO.
--------
Old 90s or 00s processors only would do ILP under "fetch-decode-execute" fixed pipelines, or maybe the Pentium's "dual pipelines". (Rasp. Pi 3 also has fixed pipelines / superscalar pipelines). If something "can't be parallelized" with the pipeline model, these simpler processors would inject bubbles into the pipeline to resolve "read-write hazards".
Modern processors have out-of-order processing and far more flexibility in discovering parallelism. Instead of slowing down the pipeline, modern processors rewrite registers and executes code out-of-order to continue searching for parallelism.
Still, the fundamental issue of "discovering latent parallelism" in your assembly language is the explicit goal of all processors from the 90s onwards.
---------
Once you know this general trick, (IE: "obtaining more parallelism" from the same assembly code), you recognize that almost the entire manual of assembly language tricks comes down to just this one concept. Instruction level parallelism.
This is right up my alley as it were. I'm not going to read the entire document, but I can remark on how the assembler language model is inferior to something I've created. Details are in my user page. In brief, an assembler language is not only a batch tool, but hides details from the hacker. Optimizing machine code requires the ability to use instructions as data when the opportunity be noticed, but that requires noticing it at all, which requires seeing the numerical values; and an assembler language also makes it difficult to do things such as put labels in the middle of an instruction. Assembler languages often introduce arbitrary name arithmetic and special names to handle this and other cases, but I found that to be inelegant; I've found such to be unnecessary; merely naming individual octets is generally sufficient.
"Modern" processors (loose terminology, because the 486 and MIPS did this) can run multiple instructions __in parallel__ to other instructions. The difference is that "yesterday's" processors are very limited in the forms of parallelism that can be "automatically detected", while today's processors are so good at it that you really only need to think about dependency cutting IMO.
--------
Old 90s or 00s processors only would do ILP under "fetch-decode-execute" fixed pipelines, or maybe the Pentium's "dual pipelines". (Rasp. Pi 3 also has fixed pipelines / superscalar pipelines). If something "can't be parallelized" with the pipeline model, these simpler processors would inject bubbles into the pipeline to resolve "read-write hazards".
Modern processors have out-of-order processing and far more flexibility in discovering parallelism. Instead of slowing down the pipeline, modern processors rewrite registers and executes code out-of-order to continue searching for parallelism.
Still, the fundamental issue of "discovering latent parallelism" in your assembly language is the explicit goal of all processors from the 90s onwards.
---------
Once you know this general trick, (IE: "obtaining more parallelism" from the same assembly code), you recognize that almost the entire manual of assembly language tricks comes down to just this one concept. Instruction level parallelism.