It offers a modification to Knuth's algorithm D which avoids all hardware division by utilizing an approximate reciprocal. The 3-by-2 division is particularly interesting as it allows you to skip the quotient loop (D3). It's a nice little optimization that I've spent some time tinkering with in my own code. This technique is notably used in GMP and other arbitrary precision integer libraries.
IIRC, I tried this method, but did not see any performance improvement. The method replaces a single `divq` instruction with a table lookup and several multiplications. On modern processors this is no longer a worthwhile trade off.
The 2-by-1 and 3-by-2 division functions described in the paper result in a very measurable speedup in my code. I think you're confusing those with the reciprocal calculation itself (which can be computed with a lookup table). I agree that part doesn't really lend itself to any significant performance benefit and is probably better calculated with a single hardware division instead.
I feel it necessary to point out that the 3-by-2 division actually has multiple benefits which are easy to miss:
1. The quotient loop can be skipped as I mentioned.
2. The "Add back" step is less likely to be triggered.
3. Since a 2-word remainder is computed with the division, you can skip 2 iterations on the multiply+subtract step.
My reimplementation of GMP documents both the 2-by-1 and 3-by-2 divisions pretty thoroughly[1][2].
I had not seen that GMP reimplementation before, but it looks very readable. Thanks!
I used the Möller & Granlund paper for a portable implementation of 2-by-1 division [1], and on my machine (which is admittedly not exactly new) it runs much faster than the DIV instruction.
It offers a modification to Knuth's algorithm D which avoids all hardware division by utilizing an approximate reciprocal. The 3-by-2 division is particularly interesting as it allows you to skip the quotient loop (D3). It's a nice little optimization that I've spent some time tinkering with in my own code. This technique is notably used in GMP and other arbitrary precision integer libraries.