Hacker News new | past | comments | ask | show | jobs | submit login

It is a pity that regex engines based on regex theory have just come into widespread use when the theory has been in place for decades. Why many language standard libraries use exponential algorithm is beyond my comprehension.



Because the exponential algorithm (1) easily supports backtracking and (2) can often be implemented with less constant factor overhead than the Thompson NFA. Building the NFA and doing NFA to DFA conversion takes time. The simplicity and resulting speed in non-pathological cases is why JS regex engines all use the exponential algorithm.

Think about quicksort. Quicksort is O(n^2) worst case, so it must be awful, right? Well, no, not really, because with a sensible choice of pivot you rarely hit that case, and the constant overhead is significantly better than the guaranteed O(n log n) mergesort for common input. Regex algorithms have a similar dynamic.

My understanding is that re2 and rust-regex manage parity with the backtracking algorithm for non-pathological regexes, much as the complicated mergesort-based Timsort can achieve parity with quicksort, but they have to do a lot of work to get there.


This sounds reasonable, but is simply not correct. There's nothing inherently slow about either algorithm in the average case and there's absolutely no reason that Thompson (or the One True NFA, Glushkov) should be slower than backtracking algorithms.

The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.

Our codebase is enormous compared to a relatively feature-rich back-tracker. Partly this is because we've been at this a while and under constant pressure to do well on lots of cases and metrics (a cleaner, 'toy' system would look nicer but maybe we wouldn't have survived as a startup if we hadn't optimized 'that one case'). But if you want to build out a Glushkov NFA graph and do a ton of optimizations, you're already more complex than a backtracking engine and you're just getting started.


> The exponential algorithms are commonly used because (a) they rarely hit their exponential behavior in practice, (b) they offer huge numbers of features that automata based algorithms cannot easily supply (capturing subexpressions and backreferences are the most obvious - yes, you can do capturing in automata but it's pretty hard) and (c) they are simpler than building something like Hyperscan.

I'm not sure how that is saying anything substantatively different from what I said... Point (a) is what I was alluding to with the quicksort analogy: you rarely hit the O(n^2) behavior with quicksort. Point (b) is what I was talking about with backtracking. Point (c) is what my simplicity remark was describing.

I didn't mean to imply that you can't go as fast as a good backtracking JIT, rather that it's an awful lot of work to.

That said, I do think that building the NFA or converting to DFA takes up-front time, right? Some regex benchmarks create regexes dynamically at runtime over and over and this can add up.


Yes, you're right. I was fixating on the wrong interpretation of your post. We do have exactly the problem you describe (startup time) and the cross-over analysis with RE2 ("How much data do I have to scan to catch a simpler compiler") would be interesting against libpcre or another backtracker.

Generally - if you can amortize the startup - we find that it's not that hard to outperform back-trackers. Some of this is chucking effort at the problem (writing SIMD implementations of some parts of the search problem) and could be replicated on the backtracking side of the fence. I do think it's simpler to make an NFA or DFA go fast, however, as the problem is very regular and predictable. The analysis of how you would do more than trivial SIMD optimizations on a true back-tracker break my brain. :-)


> yes, you can do capturing in automata but it's pretty hard

Huh, I thought it was fairly easy -- you just need a non-consuming state that pushes a mark onto a stack or pops one off of it. At least, I don't remember much of a problem implementing it when I implemented regex for Myrddin.

The trickiest part of getting matches to work right was getting the order of operations right so that when the VM forked, the state machine would pick the right order of operations to get greedy or non-greedy matches to end at the right place.

Backtracking is a far bigger problem, of course.


There are many types of automata, but if your description of "automata" includes phrases like "when the VM forked" I'm guessing that you're doing something pretty different to what we're doing. :-)

In our normal execution model, we also can't have a stack. Hyperscan is not allowed to go get more memory at run-time nor is it allowed to grab more storage proportional to the size of the input.

When we had a capturing prototype, we broke a lot of these rules (you needed either O(N) or O(lg(N)) side storage). We never had something that entirely fit our normal strictures and still did capturing; we couldn't do it in streaming mode either. We had something that could exactly simulate what you'd get with libpcre in terms of capturing and quantifiers as long as you could live with block mode and some limits on scan size dictated by our side storage use. The main problem? It was complex, and we didn't have a customer for it (this was back when we were a startup).


Fair -- I think the tagged DFA approach (https://laurikari.net/tre/) is the usual approach for doing captures without memory at runtime, but I haven't had a chance to look deeper into it.


You can do things with these backtracking algorithms that are difficult (capturing subexpressions, start of match) or impossible (backreferences, arbitrary lookarounds) to do efficiently in an automata based algorithm.

Sometimes users want these features more than they want theoretical purity. Who are we to tell them they are wrong?

That said, we do have some projects in mind to try to bridge these two worlds (feature-rich backtracking world vs speedy/streamable automata-based). Anyone interested should contact the Hyperscan team. Make a nice project at undergraduate or even postgraduate level depending on scale.


> impossible (backreferences, arbitrary lookarounds) to do efficiently in an automata based algorithm.

But... backreferences are impossible [0] to do efficiently [1] in any algorithm!

[0] Well, assuming P != NP. [1] Polynomial-time.

See: http://perl.plover.com/NPC/


I don't really like this result. It's a bit like saying finding the maximum integer in a sorted list is actually order NN because... what if the N ints are all BIGNUMS WITH N bits, huh?

The fact that you have to grow the regex* to get this result has a similar vibe to it. All the regex matching we see - and most conventional regex matching - assumes the regular expressions are fixed. Every now any then you'll see an algorithm where the size of the regex contributes to the big-O notation but then it's usually broken out as a parameter.

I don't have a good method to actually do a fixed-number-of-backreferences in less than exponential time but it seems fairly clear that if you are willing to burn polynomial time gratuitously you should be able to handle a back-reference (i.e. for input of length N and M back-references, there are only O(N^(2M)) possible final places where those back-references could match for a single match... huge but if M fixed, not polynomial). So if you had an otherwise poly-time regex algorithm you could do it over and over again trying each possible back-ref location and still be poly-time.


oops should have been "unsorted list". Or order(NlgN) in a sorted list.

The point is that number of back-references are not typically something that varies as part of a regex workload, so this proof derives something that is correct in itself, but not applicable to the way most people view regex. This has led to a meme that "all regex with backtracking is NP-hard".

My rather labored analogy is that people view comparison of two numbers as O(1) and would be nonplussed if someone demanded to know if they'd considered O(N) sized bignums... it isn't what most people are talking about when they talk about the cost of doing arithmetic operations. Similarly, arbitrarily expanding regexes is not what most people are talking about when they talk about the cost of doing regex.


> It's a bit like saying finding the maximum integer in a sorted list is actually order NN because... what if the N ints are all BIGNUMS WITH N bits, huh?

I'm not sure I follow. Why would finding the max integer in a sorted list be anything other than the time to copy the last (or first, depending on sort order) element of the list?

Certainly a valid point, though, that this isn't a common use case, so that's just not how experts in this field think about this problem.

> there are only O(N^(2M)) possible final places where those back-references could match for a single match... huge but if M fixed, not polynomial).

Sure, that buys you a pseudo-polynomial algorithm. I just don't like the idea of calling something like that polynomial -- if we did that across the board, then we'd have a poly solution for knapsack (and consequently demonstrate P = NP).


Because comparing N bits takes time proportional to N.


Just wanted to draw your attention to a proof of concept regex engine that I wrote a few years ago that does handle backreferences and arbitrary look around using a Thomson NFA and with no backtracking.

The point of this engine was to show that, while it seems to be commonly believed that an automata-based regex engine cannot handle backreferences and arbitrary look around, that is not entirely true. It does use a stack when it encounters backreferences though.

The source code for the regex engine is at https://github.com/karthikj1/JRegexPlus and a brief explanation of how it works is at http://karthikj1.github.io/JRegexPlus/

I would be interested to hear any thoughts on this.


I think the idea is interesting. I don't think most reasonable people believe that you can't handle backreferences and arbitrary lookaround in something that's largely based on automata; rather that if you're using automata you probably made the decision to use them to avoid having to allocate arbitrary amounts of memory, to "stream" or for multiple pattern support and/or performance reasons.

If those criteria frame why you are using an automata to begin with, then you either (a) can't use a stack either or (b) you at least need to make a performance argument that using NFA+stack is better than a back-tracker.

Arbitrary lookaround seems straightforward except in streaming mode, although efficient support for forward asserts for arbitrary regexes without "grows with the input size" side storage is IMO not trivial (it also breaks our ordering model). Backward lookaround seems easy if you are OK with adding logical AND to your NFA.


>Why many language standard libraries use exponential algorithm is beyond my comprehension.

"Because it enables very useful features like backtracking" doesn't really sound beyond comprehension.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: