Fast way to parse vector of "continent / country / city" in R

Question

I have a character vector in R with each string composed of "continent / country / city", e.g.

x=rep("Africa / Kenya / Nairobi", 1000000)

but the " / " is occasionally mistyped without the bracketing spaces as "/" and in some cases the city is also missing, so that it would e.g. be "Africa / Kenya", without the city.

I would like to parse this into three vectors continent, country & city, using NA if city is missing.

For country I now did something like

country = sapply(x, function(loc) trimws(strsplit(loc,"/", fixed = TRUE)[[1]][2]))

but that's very slow if the vector x is long. What would be an efficient way to parse this in R?

`strsplit` is already vectorized, so it would probably be better to call that directly rather than use `sapply` there. But what is the exact definition of "very slow" and what are the requirements for a "more efficient" result? You can always write your own C++ code with Rcpp if you want to write your own parser if performances is that much of a concern. — MrFlick, Jul 05 '21 at 07:10

GKi · Answer 1 · 2021-07-07T07:10:10.320

You can try rbind in do.call. The use of [ in lapply is done to get 3 results in case the city is missing.

x <- c("Africa / Kenya / Nairobi", "Africa/Kenya/Nairobi", "Africa / Kenya")

y <- do.call(rbind, lapply(strsplit(x, "/", TRUE), "[", 1:3))
y <- trimws(y, whitespace = " ")

y
#     [,1]     [,2]    [,3]     
#[1,] "Africa" "Kenya" "Nairobi"
#[2,] "Africa" "Kenya" "Nairobi"
#[3,] "Africa" "Kenya" NA

Or using data.table:

x <- c("Africa / Kenya / Nairobi", "Africa/Kenya/Nairobi", "Africa / Kenya")

y <- do.call(cbind, data.table::tstrsplit(x, "/", TRUE))
y <- trimws(y, whitespace = " ")

y
#     [,1]     [,2]    [,3]     
#[1,] "Africa" "Kenya" "Nairobi"
#[2,] "Africa" "Kenya" "Nairobi"
#[3,] "Africa" "Kenya" NA

Benchmark

#x <- rep("Africa / Kenya / Nairobi", 1000000) #Timings will depend on the used dataset

n <- 1e6L
f1 <- function(n) replicate(n, paste(sample(letters, sample(5:15, 1), TRUE), collapse = ""))
f2 <- function(n) sample(c("/", " /", "/ ", " / "), n, TRUE)
set.seed(42)
x <- paste0(f1(n), f2(n), f1(n), sample(c(paste0(f2(n%/%2L), f1(n%/%2L)), rep("", n - n%/%2L))))

system.time( #Method given in the question
  sapply(x, function(loc) trimws(strsplit(loc,"/", fixed = TRUE)[[1]][2])))
#       User      System verstrichen 
#     47.718       0.004      47.798 

system.time(  #Using strsplit and trimws
  trimws(do.call(rbind, lapply(strsplit(x, "/", TRUE), "[", 1:3)), whitespace = " "))
#       User      System verstrichen 
#      5.446       0.008       5.454 

system.time(  #Using data.table::tstrsplit and trimws
  trimws(do.call(cbind, data.table::tstrsplit(x, "/", TRUE)), whitespace = " "))
#       User      System verstrichen 
#      2.365       0.012       2.376 

system.time(  #Using readr::read_delim from @Anoushiravan R
  readr::read_delim(x, delim = "/", quote = "", trim_ws = TRUE, col_names = FALSE))
#       User      System verstrichen 
#      1.961       0.024       2.222 

system.time(  #Using data.table::tstrsplit with " */ *"
  do.call(cbind, data.table::tstrsplit(x, " */ *", perl=TRUE)))
#       User      System verstrichen 
#      1.394       0.000       1.394 

system.time(  #Using read.table from @akrun
  read.table(text = x, sep = "/", header = FALSE, fill = TRUE, strip.white = TRUE, na.strings = ""))
#       User      System verstrichen 
#      1.298       0.004       1.302 

system.time(  #Using data.table::fread from @akrun
  data.table::fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = ""))
#       User      System verstrichen 
#      1.146       0.016       0.996 

system.time(  #Using read.table with additional argiments
  read.table(text = x, sep = "/", header = FALSE, fill = TRUE, strip.white = TRUE, na.strings = "", nrows=length(x), comment.char = "", colClasses = c("character")))
#       User      System verstrichen 
#      1.076       0.000       1.076 

system.time(  #Using data.table::fread with stringr::str_c (or stringi::stri_c)
  data.table::fread(text = stringr::str_c(x, collapse="\n"), sep="/", fill = TRUE, na.strings = ""))
#       User      System verstrichen 
#      0.780       0.000       0.624

Using data.table::fread and creating the input string with stringr::str_c looks like to be currently the fastest of the given methods.

Many thanks - that was exactly what I was looking for! – Tom Wenseleers Jul 05 '21 at 10:15 — Tom Wenseleers, Jul 05 '21 at 10:15

akrun · Accepted Answer · 2021-07-05T19:08:35.323

4

Consider using read.table from base R

read.table(text = x, sep = "/", header = FALSE,
      fill = TRUE, strip.white = TRUE, na.strings = "")
      V1    V2      V3
1 Africa Kenya Nairobi
2 Africa Kenya Nairobi
3 Africa Kenya    <NA>

Or using fread from data.table

library(data.table)
fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = "")
   Africa Kenya Nairobi
1: Africa Kenya Nairobi
2: Africa Kenya    <NA>

Benchmarks

x <- rep("Africa / Kenya / Nairobi", 1000000)
> 
> system.time(fread(text = paste(x, collapse="\n"), sep="/", fill = TRUE, na.strings = ""))
   user  system elapsed 
  0.473   0.024   0.496 

> system.time(read.table(text = x, sep = "/", header = FALSE,
+       fill = TRUE, strip.white = TRUE, na.strings = ""))
   user  system elapsed 
  0.519   0.026   0.543 

> system.time({  #Using data.table
+   y <- do.call(cbind, data.table::tstrsplit(x, "/", TRUE))
+   y <- trimws(y, whitespace = " ")
+ })
   user  system elapsed 
  2.035   0.051   2.067

data

x <- c("Africa / Kenya / Nairobi", "Africa/Kenya/Nairobi", "Africa / Kenya")

edited Jul 05 '21 at 19:08

answered Jul 05 '21 at 18:10

akrun

874,273
37
540
662

That's also elegant... Is this slower or faster than the data.table solution above? I guess slower, right? – Tom Wenseleers Jul 05 '21 at 19:06
@TomWenseleers Please check the benchmarks. Both options seems to be faster than the fastest showed in other post – akrun Jul 05 '21 at 19:12
1

`read.table` always plays the magic for this kind of question, cheers! – ThomasIsCoding Jul 05 '21 at 21:22
Ha many thanks! Just hadn't thought about this option, even though I use read.table all the time. :-) Given that it's the fastest option I've checked this as the correct answer, even though I found the other one also very didactical... – Tom Wenseleers Jul 06 '21 at 03:13

Anoushiravan R · Answer 3 · 2021-07-05T19:22:00.583

2

I think this also can be used:

library(readr)

xx <- readr::read_delim(b, delim = "/", quote = "", trim_ws = TRUE, col_names = FALSE)

# A tibble: 3 x 3
  X1     X2    X3     
  <chr>  <chr> <chr>  
1 Africa Kenya Nairobi
2 Africa Kenya Nairobi
3 Africa Kenya NA

edited Jul 05 '21 at 19:22

answered Jul 05 '21 at 19:12

Anoushiravan R

21,622
3
18
41

1

Thank you very much dear Arun, I didn't know it. – Anoushiravan R Jul 05 '21 at 19:19
1

Also, the `col_names = FALSE` returns one more row – akrun Jul 05 '21 at 19:19
1

Oh otherwise it will use the first row as colnames, interesting. – Anoushiravan R Jul 05 '21 at 19:21

Fast way to parse vector of "continent / country / city" in R

3 Answers3

Benchmarks

data