12

Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?

The haystack String format will not change. It will always contain comma delimited data in the form of PROPER_NOUN|CATEGORY/DESCRIPTION

Common variables for both approaches:

String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;

Using String.split():

for (String current : haystack.split(","))
    if (current.contains(needle))
    {
        result=current.split("\\|")[1]);
        break; // *edit* Not part of original code - added in response to comment from Pshemo
    {

Using Pattern & Matcher:

Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);

if (matches.find())
    result=matches.group(2);

Both approaches provide the information I require.

I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex

And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.

Thank you for your time!

Conclusion

I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.

For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:

.split()/.contains()/.split

304,212,973

Pattern/Matcher w/ Pattern.compile() invoked for each iteration

230,511,000

Pattern/Matcher w/Pattern.compile() invoked prior to iteration

111,545,646

IdusOrtus
  • 1,005
  • 1
  • 16
  • 24
  • 3
    Just a small comment: if you're constructing a pattern manually from user input, always use `Pattern.quote()` to escape the string. – biziclop Jul 17 '14 at 21:23
  • 1
    Only advantage of `Pattern/Matcher` solution is that it will stop iterating over your input when it will find `needle|\w+/\w+` while `split(",")` will iterate over entire input and then will iterate again until it find sting which contains `needle`. I am not sure if `contains` is right method here, unless you are sure that searched `noun` will never appear as part of `category/description` pair. – Pshemo Jul 17 '14 at 21:31
  • @biziclop: Thanks for the .quote() tip, I wasn't aware of that method. – IdusOrtus Jul 17 '14 at 22:21
  • @Pshemo: Thank you VERY much for the contains() callout. There shouldn't be duplicate nouns, but I work with other humans and we trend away from infallability. If I go with the .split() route I'll include a 'break;' – IdusOrtus Jul 17 '14 at 22:23
  • Note that your two implementations behave differently on some input strings. For example, if the needle appears on the right side of the `|`, or if the left side does not contain a `/`, the `String.split()`-based implementation will accept it but not the `Pattern` implementation. – augurar Mar 18 '16 at 01:23

3 Answers3

15

In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.

Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.

Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.

Xinzz
  • 2,242
  • 1
  • 13
  • 26
  • Thanks Xinzz! This answer captures what I was looking for, and does it succinctly. +1 @Pshemo, though, for the potential pitfalls of the way in which I was implementing contains(). – IdusOrtus Jul 18 '14 at 16:12
1

I would say that the split() version is much better here due to the following reasons:

  • The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
  • Regular expressions are more complex, and therefore the code becomes more error-prone.
Keppil
  • 45,603
  • 8
  • 97
  • 119
  • I fully agree that .split()/.contains()/.split() is the more legible of the two options, and that some people are regexaphobes & see how regex usage can cause problems. I am a little uncertain on your third point, though. After reading @Xinzz's answer I ran some very rudimentary benchmarks and Pattern/Matcher was faster, even with instantiating multiple Matcher objects. In terms of cost do you mean CPU usage? – IdusOrtus Jul 17 '14 at 22:49
  • @Idus: You are probably right. I have been thinking about this a bit more, and the third point should be removed. It is the least significant to consider anyway. – Keppil Jul 18 '14 at 04:37
  • The paramter to `String.split()` is also a regex. – augurar Mar 18 '16 at 01:21
1

There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.

You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.

When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.

drew moore
  • 31,565
  • 17
  • 75
  • 112