Java - Parsing strings - String.split() versus Pattern & Matcher

Question

Given a String containing a comma delimited list representing a proper noun & category/description pair, what are the pros & cons of using String.split() versus Pattern & Matcher approach to find a particular proper noun and extract the associated category/description pair?

The haystack String format will not change. It will always contain comma delimited data in the form of PROPER_NOUN|CATEGORY/DESCRIPTION

Common variables for both approaches:

String haystack="EARTH|PLANET/COMFORTABLE,MARS|PLANET/HARDTOBREATHE,PLUTO|DWARF_PLANET/FARAWAY";
String needle="PLUTO";
String result=null;

Using String.split():

for (String current : haystack.split(","))
    if (current.contains(needle))
    {
        result=current.split("\\|")[1]);
        break; // *edit* Not part of original code - added in response to comment from Pshemo
    {

Using Pattern & Matcher:

Pattern pattern = pattern.compile("(" +needle+ "\|)(\w+/\w+)");
Matcher matches = pattern.matcher(haystack);

if (matches.find())
    result=matches.group(2);

Both approaches provide the information I require.

I'm wondering if any reason exists to choose one over the other. I am not currently using Pattern & Matcher within my project so this approach will require imports from java.util.regex

And, of course, if there is an objectively 'better' way to parse the information I will welcome your input.

Thank you for your time!

Conclusion

I've opted for the Pattern/Matcher approach. While a little tricky to read w/the regex, it is faster than .split()/.contains()/.split() and, more importantly to me, captures the first match only.

For what it is worth, here are the results of my imperfect benchmark tests, in nanoseconds, after 100,000 iterations:

.split()/.contains()/.split

304,212,973

Pattern/Matcher w/ Pattern.compile() invoked for each iteration

230,511,000

Pattern/Matcher w/Pattern.compile() invoked prior to iteration

111,545,646

Just a small comment: if you're constructing a pattern manually from user input, always use `Pattern.quote()` to escape the string. — biziclop, Jul 17 '14 at 21:23
Only advantage of `Pattern/Matcher` solution is that it will stop iterating over your input when it will find `needle|\w+/\w+` while `split(",")` will iterate over entire input and then will iterate again until it find sting which contains `needle`. I am not sure if `contains` is right method here, unless you are sure that searched `noun` will never appear as part of `category/description` pair. — Pshemo, Jul 17 '14 at 21:31
@biziclop: Thanks for the .quote() tip, I wasn't aware of that method. — IdusOrtus, Jul 17 '14 at 22:21
@Pshemo: Thank you VERY much for the contains() callout. There shouldn't be duplicate nouns, but I work with other humans and we trend away from infallability. If I go with the .split() route I'll include a 'break;' — IdusOrtus, Jul 17 '14 at 22:23
Note that your two implementations behave differently on some input strings. For example, if the needle appears on the right side of the `|`, or if the left side does not contain a `/`, the `String.split()`-based implementation will accept it but not the `Pattern` implementation. — augurar, Mar 18 '16 at 01:23

score 15 · Accepted Answer · answered Jul 17 '14 at 21:13

In a small case such as this, it won't matter that much. However, if you have extremely large strings, it may be beneficial to use Pattern/Matcher directly.

Most string functions that use regular expressions (such as matches(), split(), replaceAll(), etc.) makes use of Matcher/Pattern directly. Thus it will create a Matcher object every time, causing inefficiency when used in a large loop.

Thus if you really want speed, you can use Matcher/Pattern directly and ideally only create a single Matcher object.

Thanks Xinzz! This answer captures what I was looking for, and does it succinctly. +1 @Pshemo, though, for the potential pitfalls of the way in which I was implementing contains(). — IdusOrtus, Jul 18 '14 at 16:12

Keppil · Answer 2 · 2014-07-18T04:38:11.817

1

I would say that the split() version is much better here due to the following reasons:

The split() code is very clear, and it is easy to see what it does. The regex version demands much more analysis.
Regular expressions are more complex, and therefore the code becomes more error-prone.

edited Jul 18 '14 at 04:38

answered Jul 17 '14 at 21:09

Keppil

45,603
8
97
119

I fully agree that .split()/.contains()/.split() is the more legible of the two options, and that some people are regexaphobes & see how regex usage can cause problems. I am a little uncertain on your third point, though. After reading @Xinzz's answer I ran some very rudimentary benchmarks and Pattern/Matcher was faster, even with instantiating multiple Matcher objects. In terms of cost do you mean CPU usage? – IdusOrtus Jul 17 '14 at 22:49
@Idus: You are probably right. I have been thinking about this a bit more, and the third point should be removed. It is the least significant to consider anyway. – Keppil Jul 18 '14 at 04:37
The paramter to `String.split()` is also a regex. – augurar Mar 18 '16 at 01:21

score 1 · Answer 3 · answered Jul 17 '14 at 21:09

There are no advantages to using pattern/matcher in cases where the manipulation to be done is as simple as this.

You can look at String.split() as a convenience method that leverages many of the same functionalities you use when you use a pattern/matcher directly.

When you need to do more complex matching/manipulation, use a pattern/matcher, but when String.split() meets your needs, the obvious advantage to using it is that it reduces code complexity considerably - and I can think of no good reason to pass this advantage up.

Java - Parsing strings - String.split() versus Pattern & Matcher

3 Answers3

Linked