2

I am dealing with strings having two separators "*" and "|", and they are used in strings such as:

"3\*4|2\*7.4|8\*3.2"

Where the number right before "*" denotes frequency and the float or integer right after "*" denotes value. These value frequency pairs are separated using "|".

So from "3\*4|2\*7.4|8\*3.2", I would like to get a following vector:

"4","4","4","7.4","7.4","3.2","3.2","3.2","3.2","3.2","3.2","3.2","3.2"

I have come up with following syntax, which completes with no errors and warnings, but the end results something else than expected:

strsplit("3*4|2*7.4|8*3.2", "[*|]") %>% #Split into a vector with two different separator characters
  unlist %>% #strsplit returns a list, so let's unlist it
         mapply(FUN = rep,
                x = .[seq(from = 2, to = length(.), by = 2)], #these sequences mean even and odd index in this respect
                times = .[seq(from = 1, to = length(.), by = 2)], #rep() flexibly accepts times argument also as string
                USE.NAMES = FALSE) %>%
         unlist #mapply returns a list, so let's unlist it

[1] "4"   "4"   "4"   "7.4" "7.4" "7.4" "7.4" "3.2" "3.2" "4"   "4"   "4"   "4"   "4"   "4"   "4"   "7.4" "7.4" "7.4" "7.4" "7.4" "7.4" "7.4" "7.4" "3.2" "3.2" "3.2"

As you can see, something weird has happened. "4" has been repeated three times, which is correct, but "7.4" has been repeated four times (incorrectly) and so on.

What is going on here?

Bas H
  • 2,114
  • 10
  • 14
  • 23

3 Answers3

2

You could split in two steps and use lapply:

"3*4|2*7.4|8*3.2" %>% strsplit("[|]") %>%
                      unlist %>%
                      strsplit("[*]") %>%
                      lapply(function(x) rep(x[2],x[1])) %>%
                      unlist

# [1] "4"   "4"   "4"   "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
Waldi
  • 39,242
  • 6
  • 30
  • 78
1

You can substitute | for a newline, read the data into a data frame and pass it to rep().

do.call(rep,
        read.delim(text = gsub("\\|", "\n", "3*4|2*7.4|8*3.2"),
                   sep = "*",
                   header = FALSE,
                   col.names = c("times", "x"))
        )

[1] 4.0 4.0 4.0 7.4 7.4 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56
1

1a) The problem with the code in the question is that %>% is passing dot to the first argument of mapply To avoid this replace the mapply lines with this where ... represents the same arguments as in the question.

{ mapply(...) } %>%

1b) Actually mapply is not needed in the first place since rep is vectorized:

x %>%
  strsplit("[*|]") %>%
  unlist %>%
  { rep(x = .[seq(from = 2, to = length(.), by = 2)],
        times = .[seq(from = 1, to = length(.), by = 2)])
  }
 ## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"

1c) and a further simplification is to use logical values for the index realizing that they recycle:

x %>%
  strsplit("[*|]") %>%
  unlist %>%
  { rep(x = .[c(FALSE, TRUE)], times = .[c(TRUE, FALSE)]) }
## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"

1d) A base R version using R's pipes is:

x |>
  strsplit("[*|]") |>
  setNames("x") |>
  with(rep(x = x[c(FALSE, TRUE)], times = x[c(TRUE, FALSE)]))
## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"

Also note the following one-liners:

2a) The following one-liner matches the two numbers and passes them as separate arguments to the anonymous function specified using formula notation returning the output of the function. The input x is from the question and defined explicitly in the Note at the end.

library (gsubfn)

strapply(x, "([0-9]+)\\*([0-9.]+)", n + x ~ rep(x, n))[[1]]
## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"

2b) If we have a character vector of strings like x then it will also work by removing the [[1]] . In that case it will return a list of the results.

xx <- c(x, x)
strapply(xx, "([0-9]+)\\*([0-9.]+)", n + x ~ rep(x, n))
## [[1]]
## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"
##
## [[2]]
## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"

3) Another way to do it is to extract the repetition numbers and the values separately and pass each such vector to rep.

library(gsubfn)

rep(strapplyc(x, "\\*([0-9.]+)")[[1]], strapplyc(x, "(\\d+)\\*")[[1]])
## [1] "4" "4" "4" "7.4" "7.4" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2" "3.2"

Note

The input used is:

x <- "3*4|2*7.4|8*3.2"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • I personally like 1c. However, I was thinking that in this case and more generally, one should use (x %>% strsplit("[*|]"))[[1]] instead of x %>% strsplit("[*|]") %>% unlist, since the latter involves additional function unlist, and is therefore slower. Any thoughts on this? – Aku-Ville Lehtimäki Mar 14 '23 at 09:21
  • 1
    I doubt there is any material difference in performance. Also note that that code does not work because [[ is a function too. One would have to write this which seems painful: `x %>% strsplit("[*|]") %>% (\`[[\`)(1)` – G. Grothendieck Mar 14 '23 at 13:09