-1

I have a file which includes thousands of lines like this:

node: { title: "0" label: "sub_401000" color: 76 textcolor: 73 bordercolor: black }

What I need is to extract the title value and label value. For example in the line above. I need to extract 0 and sub_401000. I can split them but it takes a lot of time. I want to know what would be the most efficient way to do this process?

Danielson
  • 2,605
  • 2
  • 28
  • 51
Alex
  • 303
  • 1
  • 6
  • 17
  • 4
    As an aside, whoever wrote this file could have saved valid json instead of this, which would have made everything simpler. – xlecoustillier Jun 24 '15 at 14:46

2 Answers2

1

Something like this should do (Note I assumed there to be one space between title: and quotes.

public class Test {

    public static void main(String[] args) 
    {
        String str = "node: { title: \"0\" label: \"sub_401000\" color: 76 textcolor: 73 bordercolor: black }";
        //String regex = ".*title: \"(.*)\".*label: \"(.*)\""; better regex below suggested by pschemo
        String regex = "title: \"([^\"]*)\".*label: \"([^\"]*)\"";
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(str);
        if(m.find())
        {
            String title = m.group(1);
            String label = m.group(2);
            System.out.println(title);
            System.out.println(label);
        }
    }
}

Output:

0

sub_401000

James Wierzba
  • 16,176
  • 14
  • 79
  • 120
  • Thanks for your answer. Now it is faster than my solution :) – Alex Jun 24 '15 at 15:02
  • No problem. Feel free to accept if it is the answer – James Wierzba Jun 24 '15 at 15:05
  • 1
    You should avoid `\"(.*)\"` since `*` is greedy and causes backtracking. Use `\"([^\"]*)\"` instead. – Pshemo Jun 24 '15 at 15:06
  • 1
    Also there is no need for `.*` at start of regex since you are using `m.find()` method. – Pshemo Jun 24 '15 at 15:06
  • Explain a bit further "is greedy and causes backtracking" please? – James Wierzba Jun 24 '15 at 15:08
  • 1
    `\"(.*)\"` since `.` can match any character (except line separators) regex `".*"` will match `\"0\" label: \"sub_401000\"` since it starts and ends with `"`, but then since regex engine will not be able to match `label: ` anywhere else it will need to reduce previously matched part to check if it wasn't *too-greedy*. Take a look at this example: `"foo 'a' bar 'b' baz 'c'".replaceAll("'.*'","X")` will result in `foo X`. Now if you change it to `"foo 'a' bar 'b' baz 'c'".replaceAll("'.*' baz","X baz")` you will get `foo X baz 'c'`. – Pshemo Jun 24 '15 at 15:17
  • For more info read http://www.regular-expressions.info/repeat.html "Watch Out for The Greediness!" and especially "Looking Inside The Regex Engine" part. – Pshemo Jun 24 '15 at 15:36
0

You could try using this regex and compile it if possible since it will be used repetitively. (Note: it is meant for capturing matches using the parenthesis)

 (\w+): "*(\w*)

Regular expression visualization

Debuggex Demo

umbreon222
  • 247
  • 2
  • 8