0

I want a regex to find string between two characters but only from start delimiter to first occurrence of end delimiter

I want to extract story from the lines of following format

<metadata name="user" story="{some_text_here}" \/>

So I want to extract only : {some_text_here}

For that I am using the following regex:

<metadata name="user" story="(.*)" \/>

And java code:

public static void main(String[] args) throws IOException {
        String regexString = "<metadata name="user" story="(.*)" \/>";
        String filePath = "C:\\Desktop\\temp\\test.txt";
        Pattern p = Pattern.compile(regexString);
        Matcher m;
        try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                m = p.matcher(line);
                if (m.find()) {                     
                    System.out.println(m.group(1));
                }
            }
        }

    }

This regex mostly works fine but surprisingly if the line is:

<metadata name="user" story="My name is Nick" extraStory="something" />

Running the code also filters My name is Nick" extraStory="something where as I only want to make sure that I get My name is Nick

Also I want to make sure that there is actually no information between story="My name is Nick" and before />

Nick Div
  • 5,338
  • 12
  • 65
  • 127
  • 2
    [Compulsory link](http://stackoverflow.com/a/1732454/2071828). – Boris the Spider Jan 25 '17 at 14:17
  • 2
    You want to make the quantifier non-greedy, or exclude the ending character. – T.J. Crowder Jan 25 '17 at 14:18
  • 3
    What you need is a context-aware parser, which regex isn't. – Aaron Jan 25 '17 at 14:18
  • 1
    `(?<=story=")[^"]++(?=")` ought to work. But see my [comment above](https://stackoverflow.com/questions/41853863/regex-pattern-for-finding-string-between-two-characters-but-first-occurrence-o#comment70891433_41853863), **regex cannot parse XML in the general case**. – Boris the Spider Jan 25 '17 at 14:18
  • 1
    You really, really, really should use a parser for this. But given the specificity of your regex, you can just change `.` to `[^"]`: `` That will fix the issue you've mentioned, but I bet it will break in other situations. (Hence, parser.) – T.J. Crowder Jan 25 '17 at 14:19
  • Your code looks like XML/HTML. It would be a lot easier to use proper parser rather than regex which [can fail you in many ways](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) for this kind of structures. With jsoup you could use `Document doc = ...(parse document)...; Element metaWithStory = doc.select("metadata[story]"); String story = metaWithStory.attr("story");`. – Pshemo Jan 25 '17 at 14:21
  • Btw the following XPath [should retrieve what you want](http://www.xpathtester.com/xpath/d3997481fa06e7063c0416301a271f4a) : `//metadata[@name="user"]/@story`. I discourage the use of `//` but had to use it in absence of context. – Aaron Jan 25 '17 at 14:23
  • @Pshemo or one could use XPath without third party libraries. There are so many ways to do this robustly, there is really no excuse for this nonsense anymore. – Boris the Spider Jan 25 '17 at 14:23
  • Thanks a lot everyone for all the comments, I am definitely going to try every library mentioned here. – Nick Div Jan 25 '17 at 14:26
  • *If* there are no double quotes nested within `{sometext}`, then you most certainly can use a regex for this. Finding `quote`, `any characters not a quote`, followed by `quote` does't implicitly warrant a parser. A parser is the safest way to go, but just defaulting to "you need a parser for this" means one doesn't know how to use the tool. You certainly can get into trouble using a regex for such things _if you do not understand the problem_. – Kenneth K. Jan 25 '17 at 14:27
  • @BoristheSpider True. I posted jsoup as one of possibilities. Main purpose of my comment was to mention and link question about [possible problems with regex and XML structures](https://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg). – Pshemo Jan 25 '17 at 14:31
  • @KennethK. It is absolutely guaranteed that there wont be any more double quotes within the story. For now I have used the answer below as it solves the problem for now, but I will definitely look into parsers like everyone suggested as well. – Nick Div Jan 25 '17 at 14:31

3 Answers3

1
<metadata name="user" story="([^"]*)" \/>

[^"]* will match everything except the ". In this case the string

<metadata name="user" story="My name is Nick" extraStory="something" />

will not be matched.

radicarl
  • 327
  • 2
  • 9
1

The following XPath should solve your problem :

//metadata[@name='user' and @story and count(@*) = 2]/@story

It address the story attribute of any metadata node in the document whose name attribute is user, which also has a story attribute but no others (attributes count is 2).

(Note : //metadata[@name='user' and count(@*)=2]/@story would be enough since it would be impossible to address the story attribute of a metadata node whose second attribute isn't story)

In Java code, supposing you are handling an instance of org.w3c.dom.Document and already have an instance of XPath available, the code would be the following :

xPath.evaluate("//metadata[@name='user' and @story and count(@*) = 2]/@story", xmlDoc);

You can try the XPath here or the Java code here.

Aaron
  • 24,009
  • 2
  • 33
  • 57
  • 'extraStory' was just an example. Sorry if I was not clear. It is invalid if it has anything apart from 'name' and 'story' so 'extraStory' tag would make the line invalid, 'extraStory1' would make it invalid, 'xyz' would also make it invalid. – Nick Div Jan 25 '17 at 15:20
  • @NickDiv I've updated the XPath expression to make sure the only two attributes are `name` and `story`. – Aaron Jan 25 '17 at 15:24
  • thanks a lot. Appreciate the help. Would definitely try this out. – Nick Div Jan 25 '17 at 16:05
0

Just use Jsoup . right tool for the problem :).

its this easy :

String html; //read html file

Document document = Jsoup.parse(html);

String story = document.select("metadata[name=user]").attr("story");

System.out.println(story);
nafas
  • 5,283
  • 3
  • 29
  • 57
  • I'm not sure it's the right tool, I think it's overkill 1) if the source is well-formed XML data and 2) the user isn't familiar already with CSS / jquery selector queries. – Aaron Jan 25 '17 at 14:33
  • But wouldn't it read a string containing invalid attr as well i.e. a line containing extraStory. So this is also a limitation for me that the line should not contain anything but the name and story tag – Nick Div Jan 25 '17 at 14:34
  • @Aaron might be ever slightly slower, but its simplicty worth it. a one liner code. you can't get any simpler – nafas Jan 25 '17 at 14:35
  • @NickDiv it only extract the the data within "story" attribute. nothing more mate. that's why its the right tool for the job. :) – nafas Jan 25 '17 at 14:37
  • @nafas the XPath for this would be `//metadata[@name="user"]/@story`, which is slightly smaller than your dom manipulation because it includes the attribute selection. Jsoup is for good for parsing malformed HTML and because it enables access to the dom through the popular CSS selector queries. If you don't need any of those two capabilities, I just wouldn't call it the right tool for the problem – Aaron Jan 25 '17 at 14:37
  • @nafas I know it will get exactly what I want but I do not want to get a storyfrom a line containing extraStory, I want that line to be ignored totally :( So any story that I get should be in a line only containing name and story attributes, nothing else. – Nick Div Jan 25 '17 at 14:41
  • @NickDiv not sure what you mean mate, but based on your question, the answer gets what you want. if you want to ignore elements with an attribute you can do something like this: `.not("metadata[extraStory]")` again not sure that's what you mean. – nafas Jan 25 '17 at 14:54
  • @Aaron I think it goes down to taste really as simplicity sometimes is a matter of personal preference – nafas Jan 25 '17 at 14:58
  • @nafas Don't get me wrong, I think JSoup is awesome, but also that it wasn't meant for this kind of task. Now I can easily understand that you will be more productive on this task by using JSoup than java's embedded parsers, especially if you come from a web background, but calling it the right tool for the jobs still is a stretch IMO ;) – Aaron Jan 25 '17 at 15:19