0

I am trying to create a regex pattern that reads through a bibTex citation file and match everything inside the brackets. For those who don't know, a bibtex citation looks like the following :

@INPROCEEDINGS{Fogel95,
  AUTHOR =       {L. J. Fogel and P. J. Angeline and D. B. Fogel},
  TITLE =        {An evolutionary programming approach to self-adaptation
                    on finite state machines},
  BOOKTITLE =    {Proceedings of the Fourth International Conference on
                    Evolutionary Programming},
  YEAR =         {1995},
  pages =        {355--365}
}

@ARTICLE{Goldberg91,
  AUTHOR =       {D. Goldberg},
  TITLE =        {Real-coded genetic algorithms, virtual alphabets, and blocking},
  JOURNAL =      {Complex Systems},
  YEAR =         {1991},
  pages =        {139--167}
}

@INPROCEEDINGS{Yao96,
  AUTHOR =       {X. Yao and Y. Liu},
  TITLE =        {Fast evolutionary programming},
  BOOKTITLE =    {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary
                    Programming},
  YEAR =         {1996},
  pages =        {451--460}
}

The current pattern I have is as follows:

@(\\w+)\{(\\w+),\\s*((\\w+)\\s*=\\s*(\\"|\\{)?(.+)(\\"|\\})?,?\\s*)+\\}

This pattern matches the second citation but only parts of the first and third. I know the reason it doesn't match the third citation is because of the brackets within the left hand side of the citation ( 6$^ { th } $ ) and I have figured out that it won't match citations that have whitespaces/newlines within the left hand side of the citation elements

BOOKTITLE =    {Proceedings of the Fourth International Conference on
                Evolutionary Programming},
//This part of the citation has a newline in the middle of it.

Now I have been slaving away trying to fix my pattern, but the thing with regular expressions that I have found, is that the longer I try to fix the expression/add new conditions to it, the more confusing it gets. I am just wondering how I capture the whole citation regardless of inner brackets/parenthesis. Some citations contain no brackets/parenthesis after the "=" sign at all. Any help, along with an explanation would be greatly appreciated. I have looked at similar examples which have only confused me more due to the difficulty of deciphering a regular expression by simply glancing at it. Thank you.

Matt
  • 23
  • 3

3 Answers3

0

The simplest way to capture everything between curly braces is:

\{([^}]+)}

The negation [^}] includes all character not a curly bracket, including newlines.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

Regex is not a good parser for text with nested blocks.

If you insist on using regex, you should match the outer part first:

@INPROCEEDINGS{Fogel95,
  ???
}

Capture the ???, so you can match on that in a nested loop.

The outer regex would be something like @(\w+)\{(\w+),([^{}]*(?:\{[^{}]*\}[^{}]*)*)\}

The inner regex would be something like (\w+)\s*=\s*\{([^}]*)\}

Since a field value may be wrapped on multiple lines, you need to unwrap that.

Code

Pattern pTag = Pattern.compile("@(\\w+)" + // tag
                               "\\{" +
                                  "(\\w+)" + // name
                                  "," +
                                  "([^{}]*(?:\\{[^{}]*\\}[^{}]*)*)" + // content
                               "\\}");
Pattern pField = Pattern.compile("(\\w+)" + // field
                                 "\\s*=\\s*" +
                                 "\\{" +
                                    "([^}]*)" + // value
                                 "\\}");
Pattern pNewline = Pattern.compile("\\s*(?:\\R\\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
    String tag = mTag.group(1);
    String name = mTag.group(2);
    String content = mTag.group(3);
    for (Matcher mField = pField.matcher(content); mField.find(); ) {
        String field = mField.group(1);
        String value = mField.group(2);
        value = pNewline.matcher(value).replaceAll(" ");
        System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
    }
}

Test input

String input = "@INPROCEEDINGS{Fogel95,\n" +
               "  AUTHOR =       {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
               "  TITLE =        {An evolutionary programming approach to self-adaptation\n" +
               "                    on finite state machines},\n" +
               "  BOOKTITLE =    {Proceedings of the Fourth International Conference on\n" +
               "                    Evolutionary Programming},\n" +
               "  YEAR =         {1995},\n" +
               "  pages =        {355--365}\n" +
               "}\n" +
               "\n" +
               "@ARTICLE{Goldberg91,\n" +
               "  AUTHOR =       {D. Goldberg},\n" +
               "  TITLE =        {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
               "  JOURNAL =      {Complex Systems},\n" +
               "  YEAR =         {1991},\n" +
               "  pages =        {139--167}\n" +
               "}\n" +
               "\n" +
               "@INPROCEEDINGS{Yao96,\n" +
               "  AUTHOR =       {X. Yao and Y. Liu},\n" +
               "  TITLE =        {Fast evolutionary programming},\n" +
               "  BOOKTITLE =    {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
               "                    Programming},\n" +
               "  YEAR =         {1996},\n" +
               "  pages =        {451--460}\n" +
               "}";

Output

INPROCEEDINGS   Fogel95      AUTHOR      L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS   Fogel95      TITLE       An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS   Fogel95      BOOKTITLE   Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS   Fogel95      YEAR        1995
INPROCEEDINGS   Fogel95      pages       355--365
ARTICLE         Goldberg91   AUTHOR      D. Goldberg
ARTICLE         Goldberg91   TITLE       Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE         Goldberg91   JOURNAL     Complex Systems
ARTICLE         Goldberg91   YEAR        1991
ARTICLE         Goldberg91   pages       139--167
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • Thank you for this solution, however it doesn't take into account citations that contain multiple pairs of braces in them such as: `TITLE = "The {P}rotein {D}ata {B}ank: A Computer-Based Archival File for Macromolecular Structures",` how would I go about compensating for this? – Matt Oct 16 '17 at 06:43
  • Read my first sentence: **Regex is not a good parser for text with nested blocks.** – Andreas Oct 16 '17 at 06:45
  • I am aware that regex is not a good use, but I am learning regex in school and am required to use it for this problem. – Matt Oct 16 '17 at 06:46
  • Besides, your question shows that values are wrapped with braces (`{}`), so a *quoted* value like `TITLE = "..."` is not valid. How do you expect us to help you, if you suddenly change the rules? – Andreas Oct 16 '17 at 06:47
  • I apologize, I accidentally wrote 'parenthesis' instead of quotations. I didn't make it very clear to begin with anyways. I'll mark your answer as right anyways because it was the most helpful. But yes, some citation values are wrapped in brackets, some are wrapped in quotations, and some have neither. – Matt Oct 16 '17 at 06:50
0

As best as I can tell, Andreas's solution probably is better, but if you want just a regex string that breaks the entire string into an array, you can use this: @(.*){(.*),\s*(.*?)\s*=\s*{(.*?)},(?:\s*(.*) =\s*{([\s\S]*?)},)*?(?:\s*?(.*?) =\s*?{(.*?)})*?\s*?}

Greg S.
  • 48
  • 8