Space and time problems when using regular expressions on large data sets

Question

I have a large (greater than 200K) array of Strings which I use to search for patterns in documents. I convert each entry in the array into a regular expression before I apply it to the document. When I do this, the amount of time it takes to go through the array and sequentially perform the search dramatically increases. I believe this is do to the Pattern.compile statement that I apply to each regular expression in turn before performing the search. Precompiling the regular expression may be a way around this but I've noticed a dramatic increase in memory usage when I do this. Before precompiling, the java application runs in a VM around 1.5 gigabytes in size. After precompiling, the java program runs in a VM around 14 gigs in size.

Is there some elegant way to get around this problem or to make the program run more efficiently?

Thank you,

Elliott

Are we talking about simple matching or about replacing/splitting/any other kind of manipulation? — Martin Ender, Oct 24 '12 at 23:19
If it's simple matching, something like a [Trie](http://en.wikipedia.org/wiki/Trie) could work; you wouldn't need the ordering property, either. It would essentially be a custom NFA. Keep a list of 'current' traces through the tree, update each based on the next character. Do note that I'm unsure of whether this is a truly viable solution or not, this is just an idea off the top of my head. — Kenogu Labz, Oct 24 '12 at 23:25
I mean, this really depends on what sorts of patterns you're looking for, and what you're parsing - for how many are you actually making use of RegEx? Even if, say, 50% of them aren't actually regular expressions, but just substrings, you save a lot of time and space from the compile operation. Perhaps line-by-line is hurting? Perhaps they can be organized into groups with some sort of simpler pattern to search for? There are so, so many unanswered questions. — FrankieTheKneeMan, Oct 24 '12 at 23:57
Simple matching. I am using regular expressions. But you make a good point. The regular expressions are only useful if I am looking for a string consisting of multiple words. For example, Benign Hypertension. The regular expression is to capture cases where extra spaces and or punctuation marks show up in between different words. Some of the strings consist of only one word so that is worth checking out. — Elliott, Oct 25 '12 at 03:25
I don't know how much control you've got on the environment, but maybe having the shell execute something like grep (linux, Cygwin) might be an option? — Friso, Oct 25 '12 at 09:16
Perhaps you could normalize the document text to exclude the extra whitespace and punctuation, then the you just need to do simple string matching rather than regex searches. — JRideout, Nov 16 '12 at 00:18
have you checked what is the memory used for? Perhaps if you remove any references to the original strings and after GC the memory would be much lower? The idea with normalizing the searched text and matching strings is awesome. — akostadinov, Nov 16 '12 at 15:44

score 0 · Answer 1 · answered Dec 30 '12 at 21:59

I would avoid keeping all regexes compiled in memory, just compile one by one before use and be sure that garbage collector can clean up the used ones. That could lower peak memory usage.

Also you could theoretically merge many regexes in a single regex, using capture groups and or operators (|), then scan the document with a single pass, and finally checking which one matched calling group().

This has also the benefit of unifying similar portions of different regexes in the compiling phase.

This is a simplistic example assuming you are matching the whole document and not finding or replacing, just to illustrate the idea:

String patternA = "patternA";
String patternB = "patternB";
Pattern compiled = Pattern.compile(String.format("(%s)|(%s)",patternA, patternB))
Matcher matcher = compiled.matcher(input);
if (matcher.matches()) {
  if (matcher.group(1)) {
    // patternA matched
  }
  if (matcher.group(2)) {
    // patternB matched
  }
}

Space and time problems when using regular expressions on large data sets

1 Answers1