0

For reasons beyond the scope here, I'm building a simple bibtex parser. Some bibtex fields are delimited by a single curly brace, while others are delimited by double curly braces. Curly braces are also valid content for the field.

I have a string that I know corresponds to a single field, in the formats:

fieldName1 = {{ content }},\n    -> content
fieldName2 = { content },\n      -> content
fieldName3 = { {[}content,] },\n -> {[}content,]

With this pattern I can recover the content:

re.compile(r"(?P<name>[\w-]+?)[\s]*=[\s]*({(?P<content>.*)})",    flags=re.IGNORECASE|re.DOTALL) 

But it will contain { and } if that field uses double braces.

Is there an easier way to remove them than to test [0]=='{' and [-1]=='}'

Fábio Dias
  • 636
  • 8
  • 18
  • So `fieldName = {}},\n` would be valid too, with `fieldName` being `}`? – Eric Duminil Mar 30 '19 at 22:35
  • `Curly braces are also valid content for the field.` then in what way one could distinguish a brace as a delimiter from a brace as a content? Are `,\n` chars always following? – revo Mar 30 '19 at 22:36
  • You could put them into none-capturing capture groups like so, (?P[\w-]+?)[\s]*=[\s]*((?:{{)(?P.*)(?:}})). you can see it working here https://regex101.com/r/k4adk2/12 – Sirsmorgasboard Mar 30 '19 at 22:37
  • Rules sound weird to me. Curly braces should be escaped when used as content. `{{a}}` could mean `{a}` or `a` otherwise. – Eric Duminil Mar 30 '19 at 22:40
  • @EricDuminil with content being } yes, it would. Maybe bibtex complains about that, but for me, for this, I don't see an issue. – Fábio Dias Mar 30 '19 at 22:41
  • @Sirsmorgasboard Almost that, I want a way to match both { and {{. But thanks for the website link! It will help along the rest of the thing :) – Fábio Dias Mar 30 '19 at 22:43
  • Ah sorry I misunderstood, something more like this then? (?P[\w-]+?)[\s]*=[\s]*({ content .*}|{{ content.*}}) https://regex101.com/r/k4adk2/14 – Sirsmorgasboard Mar 30 '19 at 22:52
  • @Sirsmorgasboard I tried that too, but then I can't use named groups. It can still work, but one would have to test for None in one group then use the other, which is not really advantageous versus testing { and }. Works, but I'm curious if there is a more elegant solution. – Fábio Dias Mar 30 '19 at 23:11
  • Sorry I guess I am still not understanding, could you give your desired output for each of your 3 examples please? – Sirsmorgasboard Apr 01 '19 at 22:12
  • @sirsmorgasboard Sorry, that was indeed unclear. I updated the question to add the desired results. In other words, remove doubly braces only if we can find them on both sides. – Fábio Dias Apr 01 '19 at 23:03

1 Answers1

1

Try the following regex:

(?P<name>[\w-]+?)\s*=\s*{(?:{| {\[})?\s*(?P<content>.*?)(?:,])?\s*}{1,2}

In my test it matches all 3 your samples.

For a working example (containing test of the regex above) see https://regex101.com/r/Gy8IWu/1

The above regex test site provides detailed explanations about particular parts of the regex under test and what has been matched.

Edit

The regex matching all 3 variants, according to your comment, is:

(?P<name>[\w-]+?)\s*=\s*{{1,2}\s*(?P<content>(?:{\[})?.*?)\s*}{1,2}

See the updated example: https://regex101.com/r/Gy8IWu/2

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41