If you're not familiar with re2, you may be familiar with a couple of other little projects that the author, Russ Cox, is involved with: Go (http://golang.org/) and Plan 9 (http://swtch.com/plan9port/).
Also, here's a great bit of technical history behind Russ' re2 implementation, and why pyre2 will never be completely compatible with Python's re (without fallback):
It would be interesting to see if this could make it in Python 3.2 (It's supposed to include the Unladen Swallow stuff, so might as well take more awesomeness from Google.)
Probably, but python-core has no interest in maintaining multiple regular expression implementations: http://mail.python.org/pipermail/python-dev/2010-July/101606... (this thread starts talking about regex, which is a backwards compatible enhancement to re, but also covers re2)
Yes. That's precisely what pyre2 is doing. It's trying re2, and if it fails because it doesn't support the features, it tries the python re. Since compilation happens rarely, this is fast.
Like the libxml module which uses Cython, the compiled CPP module is distributed along with it, so you don't actually need Cython to install it or compile it. But you would need Cython to do any development on it.
I tried using pyre2 and experienced runaway memory usage. I was using re.split. The behaviour went away when I switched to the standard re module. (without changing my code.)
Did anyone else experience this?
The code in question was:
(a, b) = re.split("\n\s*\n", text, maxsplit=1)
Thank you for any insights you may have regarding this.
I'm curious. How much of a speedup in regex when using DFA instead of NFA? I believe most regex implementation use NFA but there are well known algorithms to convert NFA to DFA. It should be worth the try.
The problem is that in the worst case, converting an NFA to a DFA gives an overhead of 2^n, while keeping it as an NFA only incurs an overhead of m (where m is the regex length).
So, in the worst case (ie, under algorithmic complexity attacks), the NFA implementation is O(m * n), while the complexity of the DFA implementation is - I believe - O(2^m * n)
And fortunately, if re2 can not handle one of the regexen that you throw at it, then it will pass it down to Python's original re module (NFA_based). Perfectly seamless.
Both re2 and Python are NFA based. The difference, I believe, is that re2 doesn't support backtracking, which changes the worst-case time from linear growth to exponential growth on the input length. (For the snobs, that means that re2 regexes are regular expressions. Python's aren't.)
The difference is that re2 avoids backtracking in most cases where Python's normal regular expression engine would use it. And by avoiding backtracking, avoids the possibility of exponential slowdowns.
No. "Doesn't support backtracking" suggests that it doesn't support features that re backtracks for. In fact it does.
Also it is technically possible for re2 to (although it doesn't currently do this) add backtracking to its current strategy. If it did this, then it could offer all features of re, but would frequently be much faster. However I would not permanently rule that possibility out.
Also, here's a great bit of technical history behind Russ' re2 implementation, and why pyre2 will never be completely compatible with Python's re (without fallback):
http://swtch.com/~rsc/regexp/regexp1.html
http://swtch.com/~rsc/regexp/regexp2.html
http://swtch.com/~rsc/regexp/regexp3.html