Regex search to extract BibTeX title string in R

Question

I have a data frame in R where one column, named Title, is a BibTeX entry that looks like this:

={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n  
author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n  
journal={Journal of the ACM (JACM)},\n  
volume={38},\n  
number={3},\n  
pages={690--728},\n  
year={1991},\n  
publisher={ACM New York, NY, USA}\n}

I need to extract only the title for the BibTeX citation, which is the string after ={ and before the next }

In this example, the output should be:

Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems

I need to do this for all rows in the data frame. Not all rows have the same number of BibTeX fields, so the regex has to ignore everything after the first }

I'm currently trying sub(".*\\={\\}\\s*(.+?)\\s*\\|.*$", "\\1", data$Title) and am met with TRE pattern compilation error 'Invalid contents of {}'

How should I do this?

PaulS · Answer 1 · 2022-06-28T19:05:24.553

1

A possible solution, using stringr::str_extract and lookaround:

library(stringr)

str_extract(s, "(?<=\\{)[^}]+(?=\\})")

#> [1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

edited Jun 28 '22 at 19:05

answered Jun 28 '22 at 18:57

PaulS

21,159
2
9
26

score 0 · Answer 2 · answered Jun 28 '22 at 18:53

Mind that the { char is a special regex metacharacter, it needs to be escaped.

To match any string between the curly braces, you need a negated character class (negated bracket expression) based pattern like \{([^{}]*)}.

You can use

sub(".*?=\\{([^{}]*)}.*", "\\1", df$Title)

See the regex demo and the R demo:

Title <- c("={Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems},\n  author={Goldreich, Oded and Micali, Silvio and Wigderson, Avi},\n  journal={Journal of the ACM (JACM)},\n  volume={38},\n  number={3},\n  pages={690--728},\n  year={1991},\n  publisher={ACM New York, NY, USA}\n}")
sub(".*?=\\{([^{}]*)}.*", "\\1", Title)

Output:

[1] "Proofs that yield nothing but their validity or all languages in NP have zero-knowledge proof systems"

Pattern details:

.*? - any zero or more chars, as few as possible
=\\{ - a ={ substring
([^{}]*) - Group 1 (\1): any zero or more chars other than curly braces
} - a } char (it is not special, no need to escape)
.* - the rest of the string.

Regex search to extract BibTeX title string in R

2 Answers2