Hacker News new | past | comments | ask | show | jobs | submit login
Eliminating branches in Rust for fun but not much profit (kamalmarhubi.com)
83 points by dbaupp on Sept 16, 2015 | hide | past | favorite | 24 comments



There are some really good alternatives in reddit comments: https://www.reddit.com/r/rust/comments/3l3bnn/eliminating_br...


It's a bit weird that he doesn't provide his solution. I would be surprised if this isn't faster:

    pub fn is_token1(c: u8) -> bool {
        let mask = 
            if c < 32 { !0 }
            // c != 32 && c != 34 && c != 40 && c != 41 && c != 44 && c != 47 && !(c > 57 && c < 64)
            else if c < 64 { 1 + 4 + 256 + 512 + 4096 + 32768 - (1 << 25) }
            // c != 64 && !(c > 90 && c < 94)
            else if c < 96 { 1 + (1 << 30) - (1 << 26) }
            // c != 123 && c != 125
            else if c < 128 { (1 << 27) + (1 << 29) }
            else { !0 };

        ((mask >> (c & 31)) & 1) == 0
    }
or this for 64-bit:

    pub fn is_token2(c: u8) -> bool {
        // roughly follows the order of ascii chars: "\"(),/:;<=>?@[\\]{} \t"
        let mask = 
            if c < 64 { (1u64 << 33) - 1 + ((4u64 + 256u64 + 512u64 + 4096u64 + 32768u64) << 32) + !((1u64 << 57) - 1) }
            else if c < 128 { 1 + (1 << 30) - (1 << 26) + (1u64 << 59) + (1u64 << 61) }
            else { !0u64 };
        
        ((mask >> (c & 63)) & 1) == 0
    }
(Forgive me, it's the first time ever I write Rust...) and if it is slower, it could also be because LLVM generates a slow bt instruction.


Note that rust has bit literals, so you can explicitly write bit masks out:

    let mask = if blah { 0b11111110000000001001001100000101u64 } else { 0 }
and you can include optional underscores wherever for clarity

    let mask = if blah { 0b_11111110_00000000_10010011_00000101_u64 } else { !0 }



    fn is_token(c: u8) -> bool {
        // roughly follows the order of ascii chars: "\"(),/:;<=>?@[\\]{} \t"
        c < 128 && c > 32 && c != b'\t' && c != b'"' && c != b'(' && c != b')' &&
            c != b',' && c != b'/' && !(c > 57 && c < 65) && !(c > 90 && c < 94) &&
            c != b'{' && c != b'}'
    }
I've not done much with Rust, but why is it stated: “LOOK && AT && ALL && THOSE && BRANCHES!”

I don't see branches, just the evaluation of a complex bool expression.

If there were lots of "ifs" and "elses" I would see branches.

Can someone explain?


> Can someone explain?

&& short-circuits, so before each &&, if the LHS is false it will jump to the end of the whole thing with a `false` result

If you plug the function into goldbot (http://rust.godbolt.org) you see a bunch of conditional jumps:

    	cmpl	$7, %ecx
	jb	.LBB0_10
	movzbl	%dil, %ecx
	cmpl	$47, %ecx
	je	.LBB0_10
	cmpl	$44, %ecx
	je	.LBB0_10
	movb	%dil, %dl
	andb	$-2, %dl
	movzbl	%dl, %edx
	cmpl	$40, %edx
	je	.LBB0_10
	cmpl	$34, %ecx
	je	.LBB0_10
	cmpl	$33, %ecx
	jb	.LBB0_10
	testb	%dil, %dil
	js	.LBB0_10


Thanks, I'd not seen godbolt before - this is very useful.


Short-circuit evaluation.

The structure of the flow looks like a ladder. The (x86) assembly produced would look something like this:

    cmp al, 128
    jae false
    cmp al, 32
    jbe false
    cmp al, 9
    jz false
    cmp al, '"'
    jz false
    cmp al, '('
    jz false
    cmp al, ')'
    jz false
    ...
Edit: yes, I know 9 is less than 32 and would've been caught by the condition before it; I was just "mentally compiling" the expression directly without thinking much about optimising it.


Thanks for that. Would adding extra brackets (turning it into a tree structure) allow the CPU to use multiple ALUs to evaluate branches in parallel?

    (
       (
          ( c < 128 && c > 32 )
       &&
          ( c != b'\t' && c != b'"' )
       )
    &&
       (
          ( c != b',' && c != b'/' )
       &&
          (
             !(c > 57 && c < 65)
          &&
             !(c > 90 && c < 94)
          )
       &&
          (
             c != b'{'
          &&
             c != b'}'
          )
       )
   )


Modern cpus use speculation, they continue executing code after a conditional branch, so you get that automatically.


Thanks - learning a lot today.


A better name for the short-circuiting && operator is "andthen". Since "false && X" is false for all X, && does not evaluate its right operand if the left operand is false; this translates into branches in the assembly code, as if you had written nested "if" statements.


Not everyone running Haswell. Would be interesting to compare this performance with other processors.


Also mobile. ARM processors have less-sophisticated branch prediction, and on the lower end, no branch prediction at all!


Isn't that because lower end ARM is in-order?


Anecdotally I've had similar experiences with older processors in C#, dating back to ~2007.


I went down a similar route in a performance-sensitive lexer (started with naive code, spent some time fiddling with different ways to express it) and eventually used a lexer generator, which uses a combination of lookup tables and tests.

https://github.com/martine/ninja/blob/master/src/lexer.cc#L1...


Testing if a value is in one of two sets reminds me of this article:

http://www.hugi.scene.org/online/coding/hugi%2017%20-%20coaa...

Depending on whether the sets line up, this can be done in 2(!) instructions totaling to 4 bytes, and around 25 clock cycles. Nowhere near as fast as the bit-table lookup approach, but possibly the smallest.


For some cases I've found bitwise stuff to be faster. I'd try replacing: (c < 128 && c > 31)

With ((c < 128) & (c > 31)) != 0 if this were in C.


This strikes me as the kind of peephole optimization that LLVM would do anyway (similar to people manually converting `x / 2` to `x >> 1`, my nemesis). But maybe it's esoteric enough for them to not bother?


At least GCC definitely does it (and also the other way round).


When I first started using it, I was using GCC. But that was several years ago. Maybe I can be less explicit about it now.


I know the (non-bitwise) boolean operators have shortcutting behavior, but if evaluating the second clause has no side effects, couldn't the compiler just cheat by evaluating both clauses unconditionally and then combining the two boolean results using the equivalent bitwise operation, thus avoiding a branch?


This sounds like a textbook candidate for superoptimization.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: