John Fremlin's blog: RE2 vs cl-irregsexp performance

Posted 2010-03-17 00:00:00 GMT

Russ Cox wrote a series of articles on regexps and then released RE2. The idea is to make a general regexp engine with a time and space complexity that does not grow horribly when confronted with a pathological string/regexp combination.

This interested me as when I made cl-irregsexp I found that the performance of competing regexp engines varied wildly. The best was generally the Perl engine which has some very good special case paths.

Years ago, when I released cl-irregsexp I wanted a good headline. I looked a small example that would keep in the fast case for cl-irregsexp, but which would trip up other implementations. I ended up with searching for either indecipherable or undecipherable &mdash indecipherable|undecipherable. Funnily, in many implementations this is slower than searching for one string then the other.

If the freshly released RE2 lives up to its interesting and useful claim that complex regexps combined with large strings cannot cause it to blow up, then it should easily handle this straightforward case.

ImplementationTime (s), smaller is better
ruby.rb.41.18
python.py.36.49
pcre.32.12
perl.pl.24.55
cl-irregsexp.clozure.12.02
re2.10.09
cl-irregsexp.sbcl.6.24

And it does, and handles it well, coming second after cl-irregsexp — taking only 60% more time than cl-irregsexp with SBCL 1.0.36.13 and actually significantly faster than cl-irregsexp on ClozureCL 1.4. This is the one thing that cl-irregsexp specialises in (it doesn't do anything else very well) and it's nice to see a decent competitor emerge.

Combining the silly implementation tricks in cl-irregsexp with Russ Cox's general idea for dealing with regexps as DFA and then compiling them to native code via SBCL (if only computed gotos were added) could make something better than either without being massively complex. The avoidance of exponential blow-up would be nice too. But generally for most practical use, the Perl engine's tuned but unnecessarily self-limiting toolchest will be hard to beat.

You should check out irregexp (http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html) used in chrome. A quick test on the js console of chrome shows it at 1.43ms vs 1.99 for cl-irregsexp. It was linked from the re2 page. I bumped the iterations of both up by a factor of 10 and reproduced this as well since the times were getting so small.

Posted 2011-01-13 22:26:55 GMT by Anonymous from 209.234.187.110

Post a comment