1

I'd like to find an elegant and easily manipulable way to:

  1. extract multiple substrings from some, but not all, strings that are contained as elements of a list (each list element consists of just one long string)
  2. replace the respective original long string with these multiple substrings
  3. collapse the substrings in each list element into 1 string
  4. return a list of same length containing the replacement substrings and the untouched long strings as appropriate.

This question is a follow-on (though different) from my earlier question: replace strings of some list elements with substring. Note, I don't want to run the regex patterns over all list elements, only those elements to which the regex applies.

I know the end result can be delivered by str_replace or sub by matching the entire strings to be changed and returning the text captured by capturing groups, as follows:

library(stringr)
myList <- as.list(c("OneTwoThreeFourFive", "mnopqrstuvwxyz", "ghijklmnopqrs", "TwentyTwoFortyFourSixty"))
fileNames <- c("AB1997R.txt", "BG2000S.txt", "MN1999R.txt", "DC1997S.txt")
names(myList) <- fileNames
is1997 <- str_detect(names(myList), "1997")

regexp <- ".*(Two).*(Four).*"
myListNew2 <- myList
myListNew2[is1997] <- lapply(myList[is1997], function(i) str_replace(i, regexp, "\\1££\\2"))

## This does return what I want:
myListNew2
$AB1997R.txt
[1] "Two££Four"

$BG2000S.txt
[1] "mnopqrstuvwxyz"

$MN1999R.txt
[1] "ghijklmnopqrs"

$DC1997S.txt
[1] "Two££Four"

But I would prefer do it without having to match the entire original text (because, e.g., of time required for matching very long texts; of complexity of multiple regex patterns & difficulty of knitting them together so they match entire strings successfully). I would like to use separate regex patterns to extract the substrings and then replace the original string with these extracts. I came up with the following, which works. But surely there is an easier, better way! llply?

patternA <- "Two"
patternB <- "Four"
x <- myList[is1997]
x2 <- unlist(x)
stringA <- str_extract (x2, patternA)
stringB <- str_extract (x2, patternB)
x3 <- mapply(FUN=c, stringA, stringB, SIMPLIFY=FALSE)
x4 <- lapply(x3, function(i) paste(i, collapse = "££"))
x5 <- relist(x4,x2)
myListNew1 <- replace(myList, is1997, x5)
myListNew1

$AB1997R.txt
[1] "Two££Four"

$BG2000S.txt
[1] "mnopqrstuvwxyz"

$MN1999R.txt
[1] "ghijklmnopqrs"

$DC1997S.txt
[1] "Two££Four"
Community
  • 1
  • 1
Brigitte
  • 77
  • 9
  • You didn't process the "entire list", ... only worked on `myList[is1997]`. Generally `llply` will be _slower_ than using regex functions properly. I would guess that your new method would be a lot slower. – IRTFM Jun 04 '15 at 01:11
  • I would guess that needing to look back at the items in the referenced items like `\\1` would slow things down. Suspect code like this would be faster: `gsub(".*Two.*Four.*", "Two££Four", myList[is1997])`. It would also be easier to build a for loop to create strings for 'patt' and 'replacement'. – IRTFM Jun 04 '15 at 01:23
  • Thanks BondedDust for responding! Actually, I don't want to process the entire list, only a subset thereof, because if I applied the regexes to all texts, I would have hits in many of the non-target texts that would be incorrect. – Brigitte Jun 04 '15 at 15:34
  • As for substituting "Two££Four", it wouldn't work, as I want to substitute the hits (which are text-specific) from the subset of texts, not a fixed string. Nevertheless, many thanks! – Brigitte Jun 04 '15 at 15:36

2 Answers2

2

Something like this maybe, where I've extended the patterns you are looking for to show how it could become adaptable:

library(stringr)
patterns <- c("Two","Four","Three")
hits <- lapply(myList[is1997], function(x) {
  out <- sapply(patterns, str_extract, string=x)
  paste(out[!is.na(out)],collapse="££")
})
myList[is1997] <- hits

#[[1]]
#[1] "Two££Four££Three"
#
#[[2]]
#[1] "mnopqrstuvwxyz"
#
#[[3]]
#[1] "ghijklmnopqrs"
#
#[[4]]
#[1] "Two££Four"
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Thanks thelatemail! This sure looks neater. It also helps me understand how to write a function inside the lapply operation. – Brigitte Jun 04 '15 at 15:39
0

extract multiple matches and combine to string

library(stringi)

patterns <- 'Two|Three|Four'

hits <- stri_join_list(stri_extract_all_regex(myList[is1997],patterns),sep = '££')

myList[is1997] <- hits
asepsiswu
  • 11
  • 2