3

Trying to postprocess the LaTeX (pdf_book output) of a bookdown document to collapse biblatex citations to be able to sort them chronologically using \usepackage[sortcites]{biblatex} later on. Thus, I need to find }{ after \\autocites and replace it with ,. I am experimenting with gsub() but can't find the correct incantation.

# example input
testcase <- "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"

# desired output
"text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

A simple approach was to replace all }{

> gsub('\\}\\{', ',', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"

But this also collapses {keep}{separate}.

I was then trying to replace }{ within a 'word' (string of characters without whitspace) starting with \\autocites by using different groups and failed bitterly:

> gsub('(\\\\autocites)([^ \f\n\r\t\v}{}]+)((\\}\\{})+)', '\\1\\2\\3', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"

Addendum: The actual document contains more lines/elements than the testcase above. Not all elements contain \\autocites and in rare cases one element has more than one \\autocites. I didn't originally think this was relevant. A more realistic testcase:

testcase2 <- c("some text",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}")
rnuske
  • 83
  • 7

4 Answers4

3

A single gsub call is enough:

gsub("(?:\\G(?!^)|\\\\autocites)\\S*?\\K}{", ",", testcase, perl=TRUE)
## => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

See the regex demo. Here, (?:\G(?!^)|\\autocites) matches the end of the previous match or \autocites string, then it matches any 0 or more non-whitespace chars, but as few as possible, then \K discards the text from the current match buffer and consumes the }{ substring that is eventually replaced with a comma.

There is also a very readable solution with one regex and one fixed text replacements using stringr::str_replace_all:

library(stringr)
str_replace_all(testcase, "\\\\autocites\\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

Here, \\autocites\S+ matches \autocites and then 1+ non-whitespace chars, and gsub("}{", ",", x, fixed=TRUE) replaces (very fast) each }{ with , in the matched text.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • impressive! and it works also for the new testcase2. – rnuske Oct 24 '19 at 21:03
  • your 'gsub' solution is by far the fastest. The 'while-length-grep' approach is about 4 times slower and 'str_replace_all()' is about 20 times slower (benchmarked using testcase2). – rnuske Oct 25 '19 at 15:41
  • Wasn't familiar with `\G` and `\K`. Thanks! Can you please explain the purpose of the negative lookahead after the \G? – iod Oct 28 '19 at 02:16
  • 1
    @iod [`\G` operator](https://www.regular-expressions.info/continue.html) matches two positions: 1) start of string and 2) end of the previous successful match. By adding `(?!^)` (or `(?<!^)` / `(?<!\A)` / `(?!\A)`) the start of string position is excluded. – Wiktor Stribiżew Oct 28 '19 at 07:52
1

Not the prettiest solution, but it works. This repeatedly replaces }{ with , but only if it follows autocities with no intervening blanks.

while(length(grep('(autocites\\S*)\\}\\{', testcase, perl=TRUE))) {
    testcase = sub('(autocites\\S*)\\}\\{', '\\1,', testcase, perl=TRUE)
}

testcase
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"
G5W
  • 36,531
  • 10
  • 47
  • 80
0

I'll make the input string slightly bigger to make the algorithm more clear.

str <- "
text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}
text \\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990} text {keep}{separate}
"

We will firstly extract all the citation blocks, replace "}{" with "," in them and then put them back into the string.

# pattern for matching citation blocks
pattern <- "\\\\autocites(\\[[^\\[\\]]*\\])*(\\{[[:alnum:]]*\\})+"
cit <- str_extract_all(str, pattern)[[1]]
cit

#> [1] "\\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990}"
#> [2] "\\autocites[cf.~][]{wattPattern1947}{foxMapping2000}{runkleGap1990}"

Replace in citation blocks:

newcit <- str_replace_all(cit, "\\}\\{", ",")
newcit
#> [1] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"

Break the original string in the places where citation block was found

strspl <- str_split(str, pattern)[[1]]
strspl
#> [1] "\ntext "  " text {keep}{separate}\ntext "  " text {keep}{separate}\n"

Insert modified citation blocks:

combined <- character(length(strspl) + length(newcit))
combined[c(TRUE, FALSE)] <- strspl
combined[c(FALSE, TRUE)] <- newcit
combined
#> [1] "\ntext "                                                          
#> [2] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [3] " text {keep}{separate}\ntext "                                    
#> [4] "\\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990}"
#> [5] " text {keep}{separate}\n"

Paste it together to finalize:

newstr <- paste(combined, collapse = "")
newstr
#> [1] "\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\ntext \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}\n"

I suspect there could be a more elegant fully-regex solution based on the same idea, but I wasn't able to find one.

Iaroslav Domin
  • 2,698
  • 10
  • 19
0

I found an incantation that works. It's not pretty:

gsub("\\\\autocites[^ ]*",
  gsub("\\}\\{",",",
    gsub(".*(\\\\autocites[^ ]*).*","\\\\\\1",testcase) #all those extra backslashes are there because R is ridiculous.
    ),
  testcase)

I broke it in to lines to hopefully make it a little more intelligible. Basically, the innermost gsub extracts just the autocites (anything that follows \\autocites up to the first space), then the middle gsub replaces the }{s with commas, and the outermost gsub replaces the result of the middle one for the pattern extracted in the innermost one.

This will only work with a single autocites in a string, of course.

Also, fortune(365).

iod
  • 7,412
  • 2
  • 17
  • 36
  • Your solution looks great. Testing it with the real data set, I found I actually have elements with more than one '\\autocites'. Sorry for the imprecise testcase. – rnuske Oct 24 '19 at 20:29