I don't really like this result. It's a bit like saying finding the maximum integer in a sorted list is actually order NN because... what if the N ints are all BIGNUMS WITH N bits, huh?
The fact that you have to grow the regex* to get this result has a similar vibe to it. All the regex matching we see - and most conventional regex matching - assumes the regular expressions are fixed. Every now any then you'll see an algorithm where the size of the regex contributes to the big-O notation but then it's usually broken out as a parameter.
I don't have a good method to actually do a fixed-number-of-backreferences in less than exponential time but it seems fairly clear that if you are willing to burn polynomial time gratuitously you should be able to handle a back-reference (i.e. for input of length N and M back-references, there are only O(N^(2M)) possible final places where those back-references could match for a single match... huge but if M fixed, not polynomial). So if you had an otherwise poly-time regex algorithm you could do it over and over again trying each possible back-ref location and still be poly-time.
oops should have been "unsorted list". Or order(NlgN) in a sorted list.
The point is that number of back-references are not typically something that varies as part of a regex workload, so this proof derives something that is correct in itself, but not applicable to the way most people view regex. This has led to a meme that "all regex with backtracking is NP-hard".
My rather labored analogy is that people view comparison of two numbers as O(1) and would be nonplussed if someone demanded to know if they'd considered O(N) sized bignums... it isn't what most people are talking about when they talk about the cost of doing arithmetic operations. Similarly, arbitrarily expanding regexes is not what most people are talking about when they talk about the cost of doing regex.
> It's a bit like saying finding the maximum integer in a sorted list is actually order NN because... what if the N ints are all BIGNUMS WITH N bits, huh?
I'm not sure I follow. Why would finding the max integer in a sorted list be anything other than the time to copy the last (or first, depending on sort order) element of the list?
Certainly a valid point, though, that this isn't a common use case, so that's just not how experts in this field think about this problem.
> there are only O(N^(2M)) possible final places where those back-references could match for a single match... huge but if M fixed, not polynomial).
Sure, that buys you a pseudo-polynomial algorithm. I just don't like the idea of calling something like that polynomial -- if we did that across the board, then we'd have a poly solution for knapsack (and consequently demonstrate P = NP).
But... backreferences are impossible [0] to do efficiently [1] in any algorithm!
[0] Well, assuming P != NP. [1] Polynomial-time.
See: http://perl.plover.com/NPC/