2

Given a string like

Prefix without commas, remainder with optional suffix (optional suffix)

what would be the best Java regex to match and extract 3 parts of the string in one pass?

  1. The prefix up to the first comma
  2. The remainder up to the left parenthesis
  3. The suffix within the parenthesis

For the above example, the 3 groups (within quotes) would be

  1. "Prefix without commas"
  2. "remainder with optional suffix"
  3. " (optional suffix)"

All 3 parts of the string are of variable length. The "remainder" part may contain commas and parentheses itself and the optional suffix may or may not start with space(s), followed by left parenthesis, followed by zero or more characters, followed by right parenthesis, followed by optional spaces, followed by end-of-line.

Trying something like

([^,]*),(.*)(\s*\(.*\))?

only yields groups 1 and 2, putting group 3 at the end of group 2.

PNS
  • 19,295
  • 32
  • 96
  • 143
  • 1
    The (.*) after the comma will gobble up everything after the comma into group 2. (.*?) may work, although I haven't tried it. – blm Sep 24 '15 at 21:33
  • It won't work if the "remainder" part has a left parenthesis. We are actually searching for the last left parenthesis, if any. – PNS Sep 24 '15 at 21:36
  • Not working, it places group 3 at the end of group 2 and gives null for group 3. – PNS Sep 24 '15 at 21:43
  • Oops, sorry, missed that. – blm Sep 24 '15 at 21:45
  • No worries. Also, you form the 2nd group on the assumption that there can be no parentheses there, which is not necessarily the case. – PNS Sep 24 '15 at 21:54
  • I'll post my solution as an answer with a demo – Mariano Sep 24 '15 at 21:59

3 Answers3

2
([^,]*),(.*)(\s*\(.*\))?

The reason this fails is that the regex already succeeds with ([^,]*),(.*) and doesn't need to check (backtrack) the rest.

To get this to work, change it as follows (several options possible), which either matches without a last parenthesis, or will match with the last parenthesis:

^([^,]*),(.*[^\) ]\s*$) | ([^,]*),(.*)(\s*\(.*\))\s*$

The result ($1 + $3 and $2 + $4 should be combined, $1 and $2 are filled if there is no optional prefix) :

3: Prefix without commas
4:  remainder with optional suffix 
5: (optional suffix)

Here I assumed that your optional suffix can appear multiple times. Another way of reading your question is that you want the middle part repeated, i.e. that $3 is included in $2. You can do that as follows:

^([^,]*),(.*(?:[^\) ]\s*$ | (\s*\(.*\)\s*$)))

Result:

1: Prefix without commas
2:  remainder with optional suffix (optional suffix)  
3: (optional suffix)  

EDIT: updated above regexes to allow for whitespace after the closing parenthesis (this is subtle, you need to add the space to the negative character class), and anchored the regex for speedup and less backtracking

Abel
  • 56,041
  • 24
  • 146
  • 247
  • I will try that, thanks. Is there a "single pass" solution? – PNS Sep 24 '15 at 21:35
  • @PNS, the nature of your question requires at least (some) backtracking, but this one, or esp. the second one, minimizes this backtracking to the size of the optional part. Alternatively you can use non-greedy specifiers, but these are generally slower than greedy ones. Or you can use (negative) look-ahead, but again, this is typically slower. – Abel Sep 24 '15 at 21:42
  • The last regex has an unclosed group. Thanks for all the help. – PNS Sep 24 '15 at 21:56
  • @PNS: I updated to make it slightly more efficient and to fix a glaring parenthesis error. – Abel Sep 24 '15 at 21:57
  • @PNS: yes, the last regex missed a closing `)`, this is meanwhile fixed, you'll need to refresh ;). I tested it with http://regexhero.net, which is not the same as Java regexes, but there are no differences with regards to this particular regex (and I like its interface, if you have Silverlight). – Abel Sep 24 '15 at 21:57
1

You can use the following regex:

"^([^,]*),([^()]*)(\\s*\\(.*\\))?$"

The regex matches:

  • ^ - Beginning of the string
  • ([^,]*) - (Group 1) 0 or more characters other than ,
  • , - literal ,
  • ([^()]*) - (Group 2) 0 or more characters other than ( and )
  • (\\s*\\(.*\\))? - (Group 3) optional group (due to ? quantifier meaning 1 or 0 occurrences of the preceding subpattern):
    • \\s* - 0 or more whitespace
    • \\(.*\\) - literal ( then as many characters other than a newline as possible up to the last ).
  • $ - end of string (remove if the actual strings can be longer, and you are looking for smaller substrings).

See IDEONE demo

String str = "String prefix without commas, variable length remainder with optional suffix (optional suffix)";
Pattern ptrn = Pattern.compile("^([^,]*),([^()]*)(\\s*\\(.*\\))?$");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println("First group: " + matcher.group(1)
                  + "\nSecond group: " + matcher.group(1) 
                  + (matcher.group(3) != null ? 
                       "\nThrid group: " + matcher.group(3) : ""));
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • As for the reason why your regex fails, it is true that the 2nd `(.*)` just consumes everything from the `,` to the end (no need to give up any symbols to an optional group). I tried to keep the capturing group number to the minimum, and note that since Group 3 is optional we need to check if it is not null when accessing it later in the code. – Wiktor Stribiżew Sep 24 '15 at 21:40
  • This would work if I had access to the source code. The use case is for supplying a regex which works after matches() returns true. – PNS Sep 24 '15 at 21:44
  • This will work with `matches`, just remove `^` and `$` that will be redundant. BTW, here is the [regex demo](https://regex101.com/r/hO3oT6/1). – Wiktor Stribiżew Sep 24 '15 at 21:51
1

The following regex:

^([^,]*),(.*?)(?:\(([^()]*)\))?\s*$

Uses a lazy quantifier in group 2 to guarantee that group 3 will match if there are any parentheses. On the other hand, group 3 doesn't allow nested parens, to force a match only in the last set of parens in the string.

Code:

String text = "String prefix without commas, variable length ())(remainde()r with )optional (suffix (optional suffix)";
Pattern regex = Pattern.compile("^([^,]*),(.*?)(?:[(]([^()]*)[)])?\\s*$");
Matcher m = regex.matcher(text);
if (m.find()) {
    System.out.println("1: " + m.group(1));
    System.out.println("2: " + m.group(2));
    System.out.println("3: " + m.group(3));
}

Output:

1: String prefix without commas
2:  variable length ())(remainde()r with )optional (suffix 
3: optional suffix

DEMO

Mariano
  • 6,423
  • 4
  • 31
  • 47