regex replace parts/groups of a string in R

Question

Trying to postprocess the LaTeX (pdf_book output) of a bookdown document to collapse biblatex citations to be able to sort them chronologically using \usepackage[sortcites]{biblatex} later on. Thus, I need to find }{ after \\autocites and replace it with ,. I am experimenting with gsub() but can't find the correct incantation.

# example input
testcase <- "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"

# desired output
"text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

A simple approach was to replace all }{

> gsub('\\}\\{', ',', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"

But this also collapses {keep}{separate}.

I was then trying to replace }{ within a 'word' (string of characters without whitspace) starting with \\autocites by using different groups and failed bitterly:

> gsub('(\\\\autocites)([^ \f\n\r\t\v}{}]+)((\\}\\{})+)', '\\1\\2\\3', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"

Addendum: The actual document contains more lines/elements than the testcase above. Not all elements contain \\autocites and in rare cases one element has more than one \\autocites. I didn't originally think this was relevant. A more realistic testcase:

testcase2 <- c("some text",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}")

the additional "\\autocites" segments should be edited likewise ? — moodymudskipper, Oct 24 '19 at 21:57
yes, all '}{' need to be be converted to ',' until whitespace for all '\\autocites'-strings — rnuske, Oct 25 '19 at 06:15
Then unglue won't work, I suggest accepting Wiktor's solution if it solves your issue. — moodymudskipper, Oct 25 '19 at 07:30

score 3 · Accepted Answer · answered Oct 24 '19 at 20:35

3

A single gsub call is enough:

gsub("(?:\\G(?!^)|\\\\autocites)\\S*?\\K}{", ",", testcase, perl=TRUE)
## => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

See the regex demo. Here, (?:\G(?!^)|\\autocites) matches the end of the previous match or \autocites string, then it matches any 0 or more non-whitespace chars, but as few as possible, then \K discards the text from the current match buffer and consumes the }{ substring that is eventually replaced with a comma.

There is also a very readable solution with one regex and one fixed text replacements using stringr::str_replace_all:

library(stringr)
str_replace_all(testcase, "\\\\autocites\\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

Here, \\autocites\S+ matches \autocites and then 1+ non-whitespace chars, and gsub("}{", ",", x, fixed=TRUE) replaces (very fast) each }{ with , in the matched text.

answered Oct 24 '19 at 20:35

Wiktor Stribiżew

607,720
39
448
563

impressive! and it works also for the new testcase2. – rnuske Oct 24 '19 at 21:03
your 'gsub' solution is by far the fastest. The 'while-length-grep' approach is about 4 times slower and 'str_replace_all()' is about 20 times slower (benchmarked using testcase2). – rnuske Oct 25 '19 at 15:41
Wasn't familiar with `\G` and `\K`. Thanks! Can you please explain the purpose of the negative lookahead after the \G? – iod Oct 28 '19 at 02:16
1

@iod [`\G` operator](https://www.regular-expressions.info/continue.html) matches two positions: 1) start of string and 2) end of the previous successful match. By adding `(?!^)` (or `(?<!^)` / `(?<!\A)` / `(?!\A)`) the start of string position is excluded. – Wiktor Stribiżew Oct 28 '19 at 07:52

score 1 · Answer 2 · answered Oct 24 '19 at 14:58

1

Not the prettiest solution, but it works. This repeatedly replaces }{ with , but only if it follows autocities with no intervening blanks.

while(length(grep('(autocites\\S*)\\}\\{', testcase, perl=TRUE))) {
    testcase = sub('(autocites\\S*)\\}\\{', '\\1,', testcase, perl=TRUE)
}

testcase
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

answered Oct 24 '19 at 14:58

G5W

36,531
10
47
80

your solution works great. and it also handles testcase2 well. – rnuske Oct 25 '19 at 09:53

score 0 · Answer 3 · answered Oct 24 '19 at 15:10

I'll make the input string slightly bigger to make the algorithm more clear.

str <- "
text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}
text \\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990} text {keep}{separate}
"

We will firstly extract all the citation blocks, replace "}{" with "," in them and then put them back into the string.

# pattern for matching citation blocks
pattern <- "\\\\autocites(\\[[^\\[\\]]*\\])*(\\{[[:alnum:]]*\\})+"
cit <- str_extract_all(str, pattern)[[1]]
cit

#> [1] "\\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990}"
#> [2] "\\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990}"

Replace in citation blocks:

newcit <- str_replace_all(cit, "\\}\\{", ",")
newcit
#> [1] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"

Break the original string in the places where citation block was found

strspl <- str_split(str, pattern)[[1]]
strspl
#> [1] "\ntext "  " text {keep}{separate}\ntext "  " text {keep}{separate}\n"

Insert modified citation blocks:

combined <- character(length(strspl) + length(newcit))
combined[c(TRUE, FALSE)] <- strspl
combined[c(FALSE, TRUE)] <- newcit
combined
#> [1] "\ntext "                                                          
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [3] " text {keep}{separate}\ntext "                                    
#> [4] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [5] " text {keep}{separate}\n"

Paste it together to finalize:

newstr <- paste(combined, collapse = "")
newstr
#> [1] "\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\n"

I suspect there could be a more elegant fully-regex solution based on the same idea, but I wasn't able to find one.

See my suggested solution above for a single line, fully-regex implementation of the same idea. — iod, Oct 24 '19 at 18:10
Yes, there is such a regex, see [my answer](https://stackoverflow.com/a/58548812/3832970) — Wiktor Stribiżew, Oct 24 '19 at 20:37

iod · Answer 4 · 2019-10-24T18:21:48.463

I found an incantation that works. It's not pretty:

gsub("\\\\autocites[^ ]*",
  gsub("\\}\\{",",",
    gsub(".*(\\\\autocites[^ ]*).*","\\\\\\1",testcase) #all those extra backslashes are there because R is ridiculous.
    ),
  testcase)

I broke it in to lines to hopefully make it a little more intelligible. Basically, the innermost gsub extracts just the autocites (anything that follows \\autocites up to the first space), then the middle gsub replaces the }{s with commas, and the outermost gsub replaces the result of the middle one for the pattern extracted in the innermost one.

This will only work with a single autocites in a string, of course.

Also, fortune(365).

Your solution looks great. Testing it with the real data set, I found I actually have elements with more than one '\\autocites'. Sorry for the imprecise testcase. — rnuske, Oct 24 '19 at 20:29

regex replace parts/groups of a string in R

4 Answers4