Best efficient way to split this pattern

Question

I have a file which includes thousands of lines like this:

node: { title: "0" label: "sub_401000" color: 76 textcolor: 73 bordercolor: black }

What I need is to extract the title value and label value. For example in the line above. I need to extract 0 and sub_401000. I can split them but it takes a lot of time. I want to know what would be the most efficient way to do this process?

As an aside, whoever wrote this file could have saved valid json instead of this, which would have made everything simpler. — xlecoustillier, Jun 24 '15 at 14:46

James Wierzba · Accepted Answer · 2015-06-24T15:10:36.437

1

Something like this should do (Note I assumed there to be one space between title: and quotes.

public class Test {

    public static void main(String[] args) 
    {
        String str = "node: { title: \"0\" label: \"sub_401000\" color: 76 textcolor: 73 bordercolor: black }";
        //String regex = ".*title: \"(.*)\".*label: \"(.*)\""; better regex below suggested by pschemo
        String regex = "title: \"([^\"]*)\".*label: \"([^\"]*)\"";
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(str);
        if(m.find())
        {
            String title = m.group(1);
            String label = m.group(2);
            System.out.println(title);
            System.out.println(label);
        }
    }
}

Output:

0

sub_401000

edited Jun 24 '15 at 15:10

answered Jun 24 '15 at 14:49

James Wierzba

16,176
14
79
120

Thanks for your answer. Now it is faster than my solution :) – Alex Jun 24 '15 at 15:02
No problem. Feel free to accept if it is the answer – James Wierzba Jun 24 '15 at 15:05
1

You should avoid `\"(.*)\"` since `*` is greedy and causes backtracking. Use `\"([^\"]*)\"` instead. – Pshemo Jun 24 '15 at 15:06
1

Also there is no need for `.*` at start of regex since you are using `m.find()` method. – Pshemo Jun 24 '15 at 15:06
Explain a bit further "is greedy and causes backtracking" please? – James Wierzba Jun 24 '15 at 15:08
1

`\"(.*)\"` since `.` can match any character (except line separators) regex `".*"` will match `\"0\" label: \"sub_401000\"` since it starts and ends with `"`, but then since regex engine will not be able to match `label: ` anywhere else it will need to reduce previously matched part to check if it wasn't *too-greedy*. Take a look at this example: `"foo 'a' bar 'b' baz 'c'".replaceAll("'.*'","X")` will result in `foo X`. Now if you change it to `"foo 'a' bar 'b' baz 'c'".replaceAll("'.*' baz","X baz")` you will get `foo X baz 'c'`. – Pshemo Jun 24 '15 at 15:17
For more info read http://www.regular-expressions.info/repeat.html "Watch Out for The Greediness!" and especially "Looking Inside The Regex Engine" part. – Pshemo Jun 24 '15 at 15:36

score 0 · Answer 2 · answered Jun 24 '15 at 14:49

0

You could try using this regex and compile it if possible since it will be used repetitively. (Note: it is meant for capturing matches using the parenthesis)

 (\w+): "*(\w*)

Regular expression visualization

Debuggex Demo

answered Jun 24 '15 at 14:49

umbreon222

247
2
8

Best efficient way to split this pattern

2 Answers2