Eliminating branches in Rust for fun but not much profit

viraptor · on Sept 16, 2015

There are some really good alternatives in reddit comments: https://www.reddit.com/r/rust/comments/3l3bnn/eliminating_br...

bonzini · on Sept 16, 2015

It's a bit weird that he doesn't provide his solution. I would be surprised if this isn't faster:

    pub fn is_token1(c: u8) -> bool {
        let mask = 
            if c < 32 { !0 }
            // c != 32 && c != 34 && c != 40 && c != 41 && c != 44 && c != 47 && !(c > 57 && c < 64)
            else if c < 64 { 1 + 4 + 256 + 512 + 4096 + 32768 - (1 << 25) }
            // c != 64 && !(c > 90 && c < 94)
            else if c < 96 { 1 + (1 << 30) - (1 << 26) }
            // c != 123 && c != 125
            else if c < 128 { (1 << 27) + (1 << 29) }
            else { !0 };

        ((mask >> (c & 31)) & 1) == 0
    }

or this for 64-bit:

    pub fn is_token2(c: u8) -> bool {
        // roughly follows the order of ascii chars: "\"(),/:;<=>?@[\\]{} \t"
        let mask = 
            if c < 64 { (1u64 << 33) - 1 + ((4u64 + 256u64 + 512u64 + 4096u64 + 32768u64) << 32) + !((1u64 << 57) - 1) }
            else if c < 128 { 1 + (1 << 30) - (1 << 26) + (1u64 << 59) + (1u64 << 61) }
            else { !0u64 };
        
        ((mask >> (c & 63)) & 1) == 0
    }

(Forgive me, it's the first time ever I write Rust...) and if it is slower, it could also be because LLVM generates a slow bt instruction.

Gankro · on Sept 16, 2015

Note that rust has bit literals, so you can explicitly write bit masks out:

    let mask = if blah { 0b11111110000000001001001100000101u64 } else { 0 }

and you can include optional underscores wherever for clarity

    let mask = if blah { 0b_11111110_00000000_10010011_00000101_u64 } else { !0 }

masklinn · on Sept 16, 2015

/u/dbaupp made the same remarks: https://www.reddit.com/r/rust/comments/3l3bnn/eliminating_br...

chrisdew · on Sept 16, 2015

    fn is_token(c: u8) -> bool {
        // roughly follows the order of ascii chars: "\"(),/:;<=>?@[\\]{} \t"
        c < 128 && c > 32 && c != b'\t' && c != b'"' && c != b'(' && c != b')' &&
            c != b',' && c != b'/' && !(c > 57 && c < 65) && !(c > 90 && c < 94) &&
            c != b'{' && c != b'}'
    }

I've not done much with Rust, but why is it stated: “LOOK && AT && ALL && THOSE && BRANCHES!”

I don't see branches, just the evaluation of a complex bool expression.

If there were lots of "ifs" and "elses" I would see branches.

Can someone explain?

masklinn · on Sept 16, 2015

> Can someone explain?

&& short-circuits, so before each &&, if the LHS is false it will jump to the end of the whole thing with a `false` result

If you plug the function into goldbot (http://rust.godbolt.org) you see a bunch of conditional jumps:

    	cmpl	$7, %ecx
	jb	.LBB0_10
	movzbl	%dil, %ecx
	cmpl	$47, %ecx
	je	.LBB0_10
	cmpl	$44, %ecx
	je	.LBB0_10
	movb	%dil, %dl
	andb	$-2, %dl
	movzbl	%dl, %edx
	cmpl	$40, %edx
	je	.LBB0_10
	cmpl	$34, %ecx
	je	.LBB0_10
	cmpl	$33, %ecx
	jb	.LBB0_10
	testb	%dil, %dil
	js	.LBB0_10

chrisdew · on Sept 16, 2015

Thanks, I'd not seen godbolt before - this is very useful.

userbinator · on Sept 16, 2015

Short-circuit evaluation.

The structure of the flow looks like a ladder. The (x86) assembly produced would look something like this:

    cmp al, 128
    jae false
    cmp al, 32
    jbe false
    cmp al, 9
    jz false
    cmp al, '"'
    jz false
    cmp al, '('
    jz false
    cmp al, ')'
    jz false
    ...

Edit: yes, I know 9 is less than 32 and would've been caught by the condition before it; I was just "mentally compiling" the expression directly without thinking much about optimising it.

chrisdew · on Sept 16, 2015

Thanks for that. Would adding extra brackets (turning it into a tree structure) allow the CPU to use multiple ALUs to evaluate branches in parallel?

    (
       (
          ( c < 128 && c > 32 )
       &&
          ( c != b'\t' && c != b'"' )
       )
    &&
       (
          ( c != b',' && c != b'/' )
       &&
          (
             !(c > 57 && c < 65)
          &&
             !(c > 90 && c < 94)
          )
       &&
          (
             c != b'{'
          &&
             c != b'}'
          )
       )
   )

wolf550e · on Sept 16, 2015

Modern cpus use speculation, they continue executing code after a conditional branch, so you get that automatically.

chrisdew · on Sept 16, 2015

Thanks - learning a lot today.

alricb · on Sept 16, 2015

A better name for the short-circuiting && operator is "andthen". Since "false && X" is false for all X, && does not evaluate its right operand if the left operand is false; this translates into branches in the assembly code, as if you had written nested "if" statements.

vbezhenar · on Sept 16, 2015

Not everyone running Haswell. Would be interesting to compare this performance with other processors.

acconsta · on Sept 16, 2015

Also mobile. ARM processors have less-sophisticated branch prediction, and on the lower end, no branch prediction at all!

masklinn · on Sept 16, 2015

Isn't that because lower end ARM is in-order?

zamalek · on Sept 16, 2015

Anecdotally I've had similar experiences with older processors in C#, dating back to ~2007.

evmar · on Sept 16, 2015

I went down a similar route in a performance-sensitive lexer (started with naive code, spent some time fiddling with different ways to express it) and eventually used a lexer generator, which uses a combination of lookup tables and tests.

https://github.com/martine/ninja/blob/master/src/lexer.cc#L1...

userbinator · on Sept 16, 2015

Testing if a value is in one of two sets reminds me of this article:

http://www.hugi.scene.org/online/coding/hugi%2017%20-%20coaa...

Depending on whether the sets line up, this can be done in 2(!) instructions totaling to 4 bytes, and around 25 clock cycles. Nowhere near as fast as the bit-table lookup approach, but possibly the smallest.

phkahler · on Sept 16, 2015

For some cases I've found bitwise stuff to be faster. I'd try replacing: (c < 128 && c > 31)

With ((c < 128) & (c > 31)) != 0 if this were in C.

Gankro · on Sept 16, 2015

This strikes me as the kind of peephole optimization that LLVM would do anyway (similar to people manually converting `x / 2` to `x >> 1`, my nemesis). But maybe it's esoteric enough for them to not bother?

bonzini · on Sept 16, 2015

At least GCC definitely does it (and also the other way round).

phkahler · on Sept 17, 2015

When I first started using it, I was using GCC. But that was several years ago. Maybe I can be less explicit about it now.

rcthompson · on Sept 16, 2015

I know the (non-bitwise) boolean operators have shortcutting behavior, but if evaluating the second clause has no side effects, couldn't the compiler just cheat by evaluating both clauses unconditionally and then combining the two boolean results using the equivalent bitwise operation, thus avoiding a branch?

drewm1980 · on Sept 17, 2015

This sounds like a textbook candidate for superoptimization.