Because the exponential algorithm (1) easily supports backtracking and (2) can often be implemented with less constant factor overhead than the Thompson NFA. Building the NFA and doing NFA to DFA conversion takes time. The simplicity and resulting speed in non-pathological cases is why JS regex engines all use the exponential algorithm.
Think about quicksort. Quicksort is O(n^2) worst case, so it must be awful, right? Well, no, not really, because with a sensible choice of pivot you rarely hit that case, and the constant overhead is significantly better than the guaranteed O(n log n) mergesort for common input. Regex algorithms have a similar dynamic.
My understanding is that re2 and rust-regex manage parity with the backtracking algorithm for non-pathological regexes, much as the complicated mergesort-based Timsort can achieve parity with quicksort, but they have to do a lot of work to get there.
This sounds reasonable, but is simply not correct. There's nothing inherently slow about either algorithm in the average case and there's absolutely no reason that Thompson (or the One True NFA, Glushkov) should be slower than backtracking algorithms.
The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.
Our codebase is enormous compared to a relatively feature-rich back-tracker. Partly this is because we've been at this a while and under constant pressure to do well on lots of cases and metrics (a cleaner, 'toy' system would look nicer but maybe we wouldn't have survived as a startup if we hadn't optimized 'that one case'). But if you want to build out a Glushkov NFA graph and do a ton of optimizations, you're already more complex than a backtracking engine and you're just getting started.
> The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.
I'm not sure how that is saying anything substantatively different from what I said... Point (a) is what I was alluding to with the quicksort analogy: you rarely hit the O(n^2) behavior with quicksort. Point (b) is what I was talking about with backtracking. Point (c) is what my simplicity remark was describing.
I didn't mean to imply that you can't go as fast as a good backtracking JIT, rather that it's an awful lot of work to.
That said, I do think that building the NFA or converting to DFA takes up-front time, right? Some regex benchmarks create regexes dynamically at runtime over and over and this can add up.
Yes, you're right. I was fixating on the wrong interpretation of your post. We do have exactly the problem you describe (startup time) and the cross-over analysis with RE2 ("How much data do I have to scan to catch a simpler compiler") would be interesting against libpcre or another backtracker.
Generally - if you can amortize the startup - we find that it's not that hard to outperform back-trackers. Some of this is chucking effort at the problem (writing SIMD implementations of some parts of the search problem) and could be replicated on the backtracking side of the fence. I do think it's simpler to make an NFA or DFA go fast, however, as the problem is very regular and predictable. The analysis of how you would do more than trivial SIMD optimizations on a true back-tracker break my brain. :-)
> yes, you can do capturing in automata but it's pretty hard
Huh, I thought it was fairly easy -- you just need a non-consuming state that pushes a mark onto a stack or pops one off of it. At least, I don't remember much of a problem implementing it when I implemented regex for Myrddin.
The trickiest part of getting matches to work right was getting the order of operations right so that when the VM forked, the state machine would pick the right order of operations to get greedy or non-greedy matches to end at the right place.
There are many types of automata, but if your description of "automata" includes phrases like "when the VM forked" I'm guessing that you're doing something pretty different to what we're doing. :-)
In our normal execution model, we also can't have a stack. Hyperscan is not allowed to go get more memory at run-time nor is it allowed to grab more storage proportional to the size of the input.
When we had a capturing prototype, we broke a lot of these rules (you needed either O(N) or O(lg(N)) side storage). We never had something that entirely fit our normal strictures and still did capturing; we couldn't do it in streaming mode either. We had something that could exactly simulate what you'd get with libpcre in terms of capturing and quantifiers as long as you could live with block mode and some limits on scan size dictated by our side storage use. The main problem? It was complex, and we didn't have a customer for it (this was back when we were a startup).
Fair -- I think the tagged DFA approach (https://laurikari.net/tre/) is the usual approach for doing captures without memory at runtime, but I haven't had a chance to look deeper into it.
Think about quicksort. Quicksort is O(n^2) worst case, so it must be awful, right? Well, no, not really, because with a sensible choice of pivot you rarely hit that case, and the constant overhead is significantly better than the guaranteed O(n log n) mergesort for common input. Regex algorithms have a similar dynamic.
My understanding is that re2 and rust-regex manage parity with the backtracking algorithm for non-pathological regexes, much as the complicated mergesort-based Timsort can achieve parity with quicksort, but they have to do a lot of work to get there.