1

I'm not strong in regex, so any help would be appreciated.

I need to parse such strings:

["text", "text", ["text",["text"]],"text"]

And output should be (4 strings):

text, text, ["text",["text"]], text

I've tried this pattern (\\[[^\\[,^\\]]*\\])|(\"([^\"]*)\"):

String data="\"aa\", \"aaa\", [\"bb\", [\"1\",\"2\"]], [cc]";
Pattern p=Pattern.compile("(\\[[^\\[,^\\]]*\\])|(\"([^\"]*)\")");

But output is (quotes themselves in output are not so critical):

"aa", "aaa", "bb", "1", "2", [cc]

How to improve my regex?

Dmitry Zaytsev
  • 23,650
  • 14
  • 92
  • 146

3 Answers3

3

I'm not sure regex are able to do that kind of stuff on their own. Here is a way to do it though:

// data string
String input = "\"aa\", \"a, aa\", [\"bb\", [\"1\", \"2\"]], [cc], [\"dd\", [\"5\"]]";
System.out.println(input);

// char that can't ever be within the data string
char tempReplacement = '#';
// escape strings containing commas, e.g "hello, world", ["x, y", 42]
while(input.matches(".*\"[^\"\\[\\]]+,[^\"\\[\\]]+\".*")) {
    input = input.replaceAll("(\"[^\"\\[\\]]+),([^\"\\[\\]]+\")", "$1" + tempReplacement + "$2");
}
// while there are "[*,*]" substrings
while(input.matches(".*\\[[^\\]]+,[^\\]]+\\].*")) {
    // replace the nested "," chars by the replacement char
    input = input.replaceAll("(\\[[^\\]]+),([^\\]]+\\])", "$1" + tempReplacement + "$2");
}

// split the string by the remaining "," (i.e. those non nested)
String[] split = input.split(",");

List<String> output = new LinkedList<String>();
for(String s : split) {
    // replace all the replacement chars by a ","
    s = s.replaceAll(tempReplacement + "", ",");
    s = s.trim();
    output.add(s);
}

// syso
System.out.println("SPLIT:");
for(String s : output) {
    System.out.println("\t" + s);
}

Output:

"aa", "a, aa", ["bb", ["1", "2"]], [cc], ["dd", ["5"]]
SPLIT:
    "aa"
    "a, aa"
    ["bb", ["1","2"]]
    [cc]
    ["dd", ["5"]]

PS: the code seems complex 'cause commented. Here is a more concise version:

public static List<String> split(String input, char tempReplacement) {
    while(input.matches(".*\"[^\"\\[\\]]+,[^\"\\[\\]]+\".*")) {
        input = input.replaceAll("(\"[^\"\\[\\]]+),([^\"\\[\\]]+\")", "$1" + tempReplacement + "$2");
    }
    while(input.matches(".*\\[[^\\]]+,[^\\]]+\\].*")) {
        input = input.replaceAll("(\\[[^\\]]+),([^\\]]+\\])", "$1" + tempReplacement + "$2");
    }
    String[] split = input.split(",");
    List<String> output = new LinkedList<String>();
    for(String s : split) {
        output.add(s.replaceAll(tempReplacement + "", ",").trim());
    }
    return output;
}

Call:

String input = "\"aa\", \"a, aa\", [\"bb\", [\"1\", \"2\"]], [cc], [\"dd\", [\"5\"]]";
List<String> output = split(input, '#');
sp00m
  • 47,968
  • 31
  • 142
  • 252
  • Thank you very much - it's works perfect! But ther is one thing, that I forgot to say. What if there will be field like that: "text, with, comma"? How to replace comma nested in quotes too? – Dmitry Zaytsev Jun 05 '12 at 12:09
2

It seems that you have recursion in your input, so if you have many nested [] regexes are probably not the best solution.

For this purpose I think it's far better/easier to use simple algorithm using indexOf() and substring(). It's also aften more efficient!

alain.janinm
  • 19,951
  • 10
  • 65
  • 112
2

Unfortunately i don't think you can do that with Java regexes. What you have here is recursive expression.. This type of language is not amendable to basic regular expressions (which is what java Pattern actually is).

But it's not that hard to write a small recursive descent parser for that language.

You can check to following answer for inspiration: java method for parsing nested expressions

Community
  • 1
  • 1
Mihai Toader
  • 12,041
  • 1
  • 29
  • 33