Step 1) Install pyre2 2) import re2 as re 3) 10x faster regex

logic · on July 23, 2010

If you're not familiar with re2, you may be familiar with a couple of other little projects that the author, Russ Cox, is involved with: Go (http://golang.org/) and Plan 9 (http://swtch.com/plan9port/).

Also, here's a great bit of technical history behind Russ' re2 implementation, and why pyre2 will never be completely compatible with Python's re (without fallback):

http://swtch.com/~rsc/regexp/regexp1.html

http://swtch.com/~rsc/regexp/regexp2.html

http://swtch.com/~rsc/regexp/regexp3.html

pfedor · on July 24, 2010

Google Code Search was his intern project: http://googlecode.blogspot.com/2006/10/search-worlds-public-...

natmaster · on July 23, 2010

What kind of licensing does this have?

It would be interesting to see if this could make it in Python 3.2 (It's supposed to include the Unladen Swallow stuff, so might as well take more awesomeness from Google.)

functional-tree · on July 23, 2010

The 3-clause BSD license...

http://github.com/axiak/pyre2/blob/master/LICENSE

kingkilr · on July 23, 2010

It won't. It doesn't implement the entire regular expression language Python supports.

dkarl · on July 23, 2010

Could a regex library parse a regular expression and decide reasonably quickly which engine to use?

kingkilr · on July 23, 2010

Probably, but python-core has no interest in maintaining multiple regular expression implementations: http://mail.python.org/pipermail/python-dev/2010-July/101606... (this thread starts talking about regex, which is a backwards compatible enhancement to re, but also covers re2)

axiak · on July 24, 2010

Yes. That's precisely what pyre2 is doing. It's trying re2, and if it fails because it doesn't support the features, it tries the python re. Since compilation happens rarely, this is fast.

kvs · on July 23, 2010

Please note that RE2 doesn't support backtracking.

aaronblohowiak · on July 23, 2010

if you're using backtracking, maybe "regular expressions" aren't the right tool.

kvs · on July 24, 2010

What would be the right tool, then?

gchpaco · on July 24, 2010

Grammars are a good start. Mix in parse-controlling predicates if that isn't powerful enough (C++).

dododo · on July 23, 2010

note it depends upon cython: i suppose it's more likely it will make it into cython rather than python 3.2.

axiak · on July 23, 2010

Like the libxml module which uses Cython, the compiled CPP module is distributed along with it, so you don't actually need Cython to install it or compile it. But you would need Cython to do any development on it.

axiak · on July 23, 2010

BSD (as does the google RE2 C++ module)

democritus · on July 30, 2010

I tried using pyre2 and experienced runaway memory usage. I was using re.split. The behaviour went away when I switched to the standard re module. (without changing my code.)

Did anyone else experience this?

The code in question was:

(a, b) = re.split("\n\s*\n", text, maxsplit=1)

Thank you for any insights you may have regarding this.

sandGorgon · on July 24, 2010

Could this be uploaded to pypi, so that it is available via pip ?

pip install pyre2 -E /tmp/testing

maxiak · on July 24, 2010

If you read the readme under installation, you'll find that it is.

pip install re2 -E /tmp/testing

ww520 · on July 23, 2010

I'm curious. How much of a speedup in regex when using DFA instead of NFA? I believe most regex implementation use NFA but there are well known algorithms to convert NFA to DFA. It should be worth the try.

ori_b · on July 23, 2010

The problem is that in the worst case, converting an NFA to a DFA gives an overhead of 2^n, while keeping it as an NFA only incurs an overhead of m (where m is the regex length).

So, in the worst case (ie, under algorithmic complexity attacks), the NFA implementation is O(m * n), while the complexity of the DFA implementation is - I believe - O(2^m * n)

koenigdavidmj · on July 23, 2010

And fortunately, if re2 can not handle one of the regexen that you throw at it, then it will pass it down to Python's original re module (NFA_based). Perfectly seamless.

ori_b · on July 24, 2010

Both re2 and Python are NFA based. The difference, I believe, is that re2 doesn't support backtracking, which changes the worst-case time from linear growth to exponential growth on the input length. (For the snobs, that means that re2 regexes are regular expressions. Python's aren't.)

btilly · on July 24, 2010

The difference is that re2 avoids backtracking in most cases where Python's normal regular expression engine would use it. And by avoiding backtracking, avoids the possibility of exponential slowdowns.

ori_b · on July 24, 2010

Isn't that what I just said?

btilly · on July 24, 2010

No. "Doesn't support backtracking" suggests that it doesn't support features that re backtracks for. In fact it does.

Also it is technically possible for re2 to (although it doesn't currently do this) add backtracking to its current strategy. If it did this, then it could offer all features of re, but would frequently be much faster. However I would not permanently rule that possibility out.

ori_b · on July 24, 2010

I'm talking about the C++ re2 engine, not the Python wrapper.

btilly · on July 24, 2010

So was I.