1

I want (for LSAfun::genericSummary) to split some strings by c(".", "!", "?"). I use the option fixed = TRUE but it still return the worng result. I want to understand why it doesn't work because I can't modify the call.

Actually, it's not called directly but via LSAfun::genericSummary. And the result is not the expected one because of the strsplit unexpected result.

strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
         split = c(".", "!", "?"), fixed = TRUE)[[1]] 

returns :

[1] "Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?"

expected :

[1] "Faut-il reconnaitre le vote blanc " " Faut-il rendre le vote obligatoire " ""

I'm lost... anyone for an explanation ?

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.3.0 yaml_2.1.18           

the function :

function (text, k, split = c(".", "!", "?"), min = 5, breakdown = FALSE, 
    ...) 
{
    sentences <- unlist(strsplit(text, split = split, fixed = T))
    if (breakdown == TRUE) {
        sentences <- breakdown(sentences)
    }
    sentences <- sentences[nchar(sentences) > min]
    td = tempfile()
    dir.create(td)
    for (i in 1:length(sentences)) {
        docname <- paste("sentence", i, ".txt", sep = "")
        write(sentences[i], file = paste(td, docname, sep = "/"))
    }
    A <- textmatrix(td, ...)
    rownames <- rownames(A)
    colnames <- colnames(A)
    A <- matrix(A, nrow = nrow(A), ncol = ncol(A))
    rownames(A) <- rownames
    colnames(A) <- colnames
    unlink(td, T, T)
    Vt <- lsa(A, dims = length(sentences))$dk
    snum <- vector(length = k)
    for (i in 1:k) {
        snum[i] <- names(Vt[, i][abs(Vt[, i]) == max(abs(Vt[, 
            i]))])
    }
    snum <- gsub(snum, pattern = "[[:alpha:]]", replacement = "")
    snum <- gsub(snum, pattern = "[[:punct:]]", replacement = "")
    snum <- as.integer(snum)
    summary.sentences <- sentences[snum]
    return(summary.sentences)
}
<environment: namespace:LSAfun>

2 Answers2

2

For multiple split elements, place it inside a [] and remove the fixed = TRUE or paste the patterns with a | to split either by one of them

strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?",
            split = "[.!?]")[[1]] 

According to ?strsplit

split - If empty matches occur, in particular if split has length 0, x is split into single characters. If split has length greater than 1, it is re-cycled along x.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • thank you, this does work when used directly. but I can't use it. the call is made via the function `LSAfun::genericSummary`, I'd like to know why it doesn't work in the first form – G. Lombardo Jan 15 '19 at 10:48
  • @G.Lombardo. In that case, create the pattern with `paste` i.e. `v1 <- c(".", "|", "?"); pat <- paste0("[", paste(v1, collapse=""), "]")` – akrun Jan 15 '19 at 10:51
  • 1
    ok, I worte a new function to bypass this problem. thank you. – G. Lombardo Jan 15 '19 at 11:04
2

You can also omit the fixed = TRUE part and escape the characters, i.e.

strsplit("Faut-il reconnaitre le vote blanc ? Faut-il rendre le vote obligatoire ?", c("\\.|!|\\?"))

Of course it will not be as efficient since we are going through the regex engine.

Sotos
  • 51,121
  • 6
  • 32
  • 66