Back to Top

Thursday, December 11, 2008

The big java regex shoutout

I discovered recently that the built-in java regex library has problems with some expressions, so I set out to find alternatives.

Searching for regex benchmarks, I found the following page: Java Regular expression library benchmarks (it also has an older version). The original IBM article also contains a benchmark. However both of these resources are a little dated, so I thought that I'll remake the benchmark. Below are the results. I've only given relative results, because the exact times are irrelevant:

Packages Failures Time
java.util.regex.* 1.6 0 6
dk.brics.automaton.* 1.7.2 3 1
gnu.regexp.RE 1.1.4 0 175
jregex.* 1.2.01 0 5
com.karneim.util.collection.regex.* 1.1.1 3 2
org.apache.regexp.* 1.5 0 100
com.stevenrbrandt.ubiq2.v10.pattwo.* 0 176
kmy.regex.util.* 0.1.2 5 2

How to read the table? The failures column means that (a) either the library created exceptions or (b) failed to correctly match strings. These libraries will have shorted times because they effectively skipped some tests.

My conclusion is: the built in library is very good (and widely available). Try to stick with it. Also, porting regular expressions between engines can be very tricky, even if they use only a few more "exotic" features (like backreferences). The more such features you use, the less chance you have of changing out the regex library implementation and not have any problems. The best thing is if you have unit tests to confirm that you match / reject what you intend.

Update: download the source code for the benchmark here (available under the GPL v3 license).


Post a Comment

You can use some HTML tags, such as <b>, <i>, <a>. Comments are moderated, so there will be a delay until the comment appears. However if you comment, I follow.