Because the exponential algorithm (1) easily supports backtracking and (2) can o...

glangdale · on June 22, 2017

This sounds reasonable, but is simply not correct. There's nothing inherently slow about either algorithm in the average case and there's absolutely no reason that Thompson (or the One True NFA, Glushkov) should be slower than backtracking algorithms.

The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.

Our codebase is enormous compared to a relatively feature-rich back-tracker. Partly this is because we've been at this a while and under constant pressure to do well on lots of cases and metrics (a cleaner, 'toy' system would look nicer but maybe we wouldn't have survived as a startup if we hadn't optimized 'that one case'). But if you want to build out a Glushkov NFA graph and do a ton of optimizations, you're already more complex than a backtracking engine and you're just getting started.

pcwalton · on June 22, 2017

> The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.

I'm not sure how that is saying anything substantatively different from what I said... Point (a) is what I was alluding to with the quicksort analogy: you rarely hit the O(n^2) behavior with quicksort. Point (b) is what I was talking about with backtracking. Point (c) is what my simplicity remark was describing.

I didn't mean to imply that you can't go as fast as a good backtracking JIT, rather that it's an awful lot of work to.

That said, I do think that building the NFA or converting to DFA takes up-front time, right? Some regex benchmarks create regexes dynamically at runtime over and over and this can add up.

glangdale · on June 22, 2017

Yes, you're right. I was fixating on the wrong interpretation of your post. We do have exactly the problem you describe (startup time) and the cross-over analysis with RE2 ("How much data do I have to scan to catch a simpler compiler") would be interesting against libpcre or another backtracker.

Generally - if you can amortize the startup - we find that it's not that hard to outperform back-trackers. Some of this is chucking effort at the problem (writing SIMD implementations of some parts of the search problem) and could be replicated on the backtracking side of the fence. I do think it's simpler to make an NFA or DFA go fast, however, as the problem is very regular and predictable. The analysis of how you would do more than trivial SIMD optimizations on a true back-tracker break my brain. :-)

ori_b · on June 22, 2017

> yes, you can do capturing in automata but it's pretty hard

Huh, I thought it was fairly easy -- you just need a non-consuming state that pushes a mark onto a stack or pops one off of it. At least, I don't remember much of a problem implementing it when I implemented regex for Myrddin.

The trickiest part of getting matches to work right was getting the order of operations right so that when the VM forked, the state machine would pick the right order of operations to get greedy or non-greedy matches to end at the right place.

Backtracking is a far bigger problem, of course.

glangdale · on June 22, 2017

There are many types of automata, but if your description of "automata" includes phrases like "when the VM forked" I'm guessing that you're doing something pretty different to what we're doing. :-)

In our normal execution model, we also can't have a stack. Hyperscan is not allowed to go get more memory at run-time nor is it allowed to grab more storage proportional to the size of the input.

When we had a capturing prototype, we broke a lot of these rules (you needed either O(N) or O(lg(N)) side storage). We never had something that entirely fit our normal strictures and still did capturing; we couldn't do it in streaming mode either. We had something that could exactly simulate what you'd get with libpcre in terms of capturing and quantifiers as long as you could live with block mode and some limits on scan size dictated by our side storage use. The main problem? It was complex, and we didn't have a customer for it (this was back when we were a startup).

ori_b · on June 23, 2017

Fair -- I think the tagged DFA approach (https://laurikari.net/tre/) is the usual approach for doing captures without memory at runtime, but I haven't had a chance to look deeper into it.