Back to Top

Monday, March 15, 2010

In praise of Regexp::Assemble

...and of the Perl modules in general. I had the following problem:

Given a list of 16 character alphanumeric IDs, find all the lines from a large-ish (~6GB) logfile which contain at least one of the IDs.

The naive approach was to construct a big regular expression like \W(\QID1\E|\QID2\E|\QID3\E...)\W and match it against every line (I needed to capture the actual ID to know in which bucket to place the line). Needless to say, as it is the case with most naive approaches, it was slooooow (basically, it was hogging the CPU, not the disk). So, by searching around a little bit I found Regexp::Optimizer and Regexp::Assemble. Of the two the later seemed the more mature one, so – after quickly installing it with CPAN – I’ve put it into my code and made it run at the “speed of the disk”. W00t! Perl + CPAN + clever modules rock!

PS. A little benchmark data (take it with a grain of salt, since you should be profiling not benchmarking most of the time):

  • Unoptimized regex size: 873 427 characters
  • Optimized regex: 69 536 characters
  • Unoptimized regex matchtime over 380 MB of data: ~1.9 hours (which would mean a throughput of ~58KB / sec – well below disk speed)
  • Optimized regex matching over the same 380 MB of data: 2 sec (throughput: 190 MB/sec !!!)

How cool is this?


Post a Comment

You can use some HTML tags, such as <b>, <i>, <a>. Comments are moderated, so there will be a delay until the comment appears. However if you comment, I follow.