Parsing BibTeX record with Java RegEx

Question

I have to write simple BibTeX parser using Java regular expressions. Task is a bit simplified: every tag value is between quotation marks "", not brackets {}. The thing is, {} can be inside "".

I'm trying to cut single records from entire String file, e. g. I want to get @book{...} as String. The problem is that there can be no comma after last tag, so it can end like: author = "john"}.

I've tried @\w*\{[\s\S]*?\}, but it stops if I have } in any tag value between "". There is also no guarantee that } will be in separate line, it can be directly after last tag value (which may not end with " either, since it can be an integer).

Can you help me with this?

Well, if the closing brace can be part of a tag value then this is hard to do. Is the requirement to use regex a hard one or did you come up with this as a solution? If it is a hard requirement then I'd assume it's meant as a learning exercise and in that case you could either assume tags won't contain braces or state that those won't be supported - the goal might be to lead you to that realization. — Thomas, Nov 26 '18 at 12:49
Using regex was a suggestion from the teacher, I've also used it in the rest of the project (since this problem is a part of the entire parser) and it worked fine. While it is meant as a learning exercise, teacher explicitly stated that tags may contain braces. I may use String.split() though, but I don't know how. — qalis, Nov 26 '18 at 13:02

score 0 · Answer 1 · answered Nov 26 '18 at 13:17

0

I've found a hack, it may help someone with same problem: there must be new line character after } sign. If end of value is only " (} sign doesn't end any value), then [\r\n] at the end of regex will suffice.

answered Nov 26 '18 at 13:17

qalis

1,314
1
16
44

score 0 · Accepted Answer · answered Nov 26 '18 at 13:22

You could try the following expression as a basis: @\w+\{(?>\s*\w+\s*=\s*"[^"]*")*\}

Exlanation:

@\w+\{...\} would be the record, e.g. @book{...}
(?>...)* means a non-capturing group that can occur multiple times or not at all - this is meant to represent the tags
\s*\w+\s*=\s*"[^"]*" would mean a tag which could be preceded by whitespace (\s*). The tag's value has to be in double quotes and anything between double quotes will be consumed, even curly braces.

Note that there might be some more cases to take into account but this should be able to handle curly braces in tag values because it will "consume" every content between the double quotes, thus it wouldn't match if the closing curly brace were missing (e.g. it would match @book{ title="the use of { and }" author="John {curly} Johnson"} but not @book{ title="the use of { and }" author="John {curly} Johnson").

Parsing BibTeX record with Java RegEx

2 Answers2