4

I have a 67MM row data.table with people names and surname separated by spaces. I just need to create a new column for each word.

Here is an small subset of the data:

n <- structure(list(Subscription_Id = c("13.855.231.846.091.000", 
"11.156.048.529.090.800", "24.940.584.090.830", "242.753.039.111.124", 
"27.843.782.090.830", "13.773.513.145.090.800", "25.691.374.090.830", 
"12.236.174.155.090.900", "252.027.904.121.210", "11.136.991.054.110.100"
), Account_Desc = c("AGUAYO CARLA", "LEIVA LILIANA", "FULLANA MARIA LAURA", 
"PETREL SERGIO", "IPTICKET SRL", "LEDESMA ORLANDO", "CATTANEO LUIS RAUL", 
"CABRAL CARMEN ESTELA", "ITURGOYEN HECTOR", "CASA CASILDO"), 
    V1 = c("AGUAYO", "LEIVA", "FULLANA", "PETREL", "IPTICKET", 
    "LEDESMA", "CATTANEO", "CABRAL", "ITURGOYEN", "CASA"), V2 = c("CARLA", 
    "LILIANA", "MARIA", "SERGIO", "SRL", "ORLANDO", "LUIS", "CARMEN", 
    "HECTOR", "CASILDO"), V3 = c(NA, NA, "LAURA", NA, NA, NA, 
    "RAUL", "ESTELA", NA, NA), `NA` = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    )), .Names = c("Subscription_Id", "Account_Desc", "V1", "V2", 
"V3", NA), class = c("data.table", "data.frame"), row.names = c(NA, 
-10L), .internal.selfref = <pointer: 0x0000000000200788>)


require("data.table")
n <- data.table(n)

Expected Output

#           Subscription_Id         Account_Desc        V1      V2     V3 NA
# 1: 13.855.231.846.091.000         AGUAYO CARLA    AGUAYO   CARLA     NA NA
# 2: 11.156.048.529.090.800        LEIVA LILIANA     LEIVA LILIANA     NA NA
# 3:     24.940.584.090.830  FULLANA MARIA LAURA   FULLANA   MARIA  LAURA NA

1st Attempt

How to make this work would be the first question

library(stringr)
# This separates the strings, but i loose the Subscription_Id variable.
n[, str_split_fixed(Account_Desc, "[ +]", 4)]

# This doesn't work.
n[, paste0("V",1:4) := str_split_fixed(Account_Desc, "[ +]", 4)]

2nd Attempt

This works, but i seem to be doing the calculation 3 times. Not sure if its the most effiecient way

cols = paste0("V",1:3)
for(j in 1:3){
  set(n,i=NULL,j=cols[j],value = sapply(strsplit(as.character(n$Account_Desc),"[ +]"), "[", j))
}

Let's use big_n to benchmarck

big_n <- data.table(Subscription_Id = rep(n[,Subscription_Id],1e7),
                    Account_Desc = rep(n[,Account_Desc],1e7)
                    )
marbel
  • 7,560
  • 6
  • 49
  • 68
  • 1
    Wait until `sep2` is implemented in `fread`? ;) – Roland Feb 12 '14 at 16:22
  • @Roland, I see your wink, but if I remember correctly, `fread` is not going to get a `fill` type argument (as is present in `read.table`) any time soon, which would mean it would have a hard time with "unbalanced" strings like these. – A5C1D2H2I1M1N2O1R2T1 Feb 12 '14 at 16:36
  • @AnandaMahto Ah, I didn't see that there are people with multiple given names. I usually forget that since I'm one of the poor guys who's got only one. – Roland Feb 12 '14 at 16:40
  • 1
    `read.csv` might take years – Baumann Feb 12 '14 at 16:49
  • 1
    there are 2 FR's (already out there) that could really help you out here - one is having `rbindlist` work with lists of vectors, and the other one is having a `fill` argument for it; until those are implemented I think your 2nd solution (with possibly some minor adjustments) is the way to go – eddi Feb 12 '14 at 17:11
  • 1
    ... and somehow, nobody has pointed out your timing is wrong. You're splitting just 10 rows, not 1e7 :-) – A5C1D2H2I1M1N2O1R2T1 Feb 12 '14 at 18:02
  • @AnandaMahto I rep the ten rows to 10e7 in big_n, (in total there are 100 MM rows in big_n). Not sure if this is equivalent, but it works for reproducibility. – marbel Feb 12 '14 at 18:07
  • @MartínBel, look at what you have in your `set_method` function. You have hard-coded `n$Account_Desc` in your `strsplit` line. – A5C1D2H2I1M1N2O1R2T1 Feb 12 '14 at 18:08
  • 2
    Following @Arun's recommendation about moving `strsplit` out of the loop makes a big difference, as does using `strsplit(input, " ", fixed = TRUE)`. Incorporating both those suggestions reduced the time I was getting on 1e5 rows from ~ 25 seconds to ~ 9 seconds. – A5C1D2H2I1M1N2O1R2T1 Feb 12 '14 at 18:50

2 Answers2

7

I don't work with datasets anywhere near this scale, so I have no idea if this is going to be of use or not. One thing that comes to mind is to use a matrix and matrix indexing.

Since I'm impatient, I've only tried it on 1e5 rows on my slow system :-)

Create your sample data

big_n <- data.table(Subscription_Id = rep(n[,Subscription_Id],1e5),
                    Account_Desc = rep(n[,Account_Desc],1e5))

Write a function to create your matrix

StringMat <- function(input) {
  Temp <- strsplit(input, " ", fixed = TRUE)
  Lens <- vapply(Temp, length, 1L)
  A <- unlist(Temp, use.names = FALSE)
  Rows <- rep(sequence(length(Temp)), Lens)
  Cols <- sequence(Lens)
  m <- matrix(NA, nrow = length(Temp), ncol = max(Lens),
              dimnames = list(NULL, paste0("V", sequence(max(Lens)))))
  m[cbind(Rows, Cols)] <- A
  m
}

Time it and view the output

system.time(outB1 <- cbind(big_n, StringMat(big_n$Account_Desc)))
#    user  system elapsed 
#   4.524   0.000   4.533 
outB1
#                 Subscription_Id         Account_Desc        V1      V2     V3
#       1: 13.855.231.846.091.000         AGUAYO CARLA    AGUAYO   CARLA     NA
#       2: 11.156.048.529.090.800        LEIVA LILIANA     LEIVA LILIANA     NA
#       3:     24.940.584.090.830  FULLANA MARIA LAURA   FULLANA   MARIA  LAURA
#       4:    242.753.039.111.124        PETREL SERGIO    PETREL  SERGIO     NA
#       5:     27.843.782.090.830         IPTICKET SRL  IPTICKET     SRL     NA
#      ---                                                                     
#  999996: 13.773.513.145.090.800      LEDESMA ORLANDO   LEDESMA ORLANDO     NA
#  999997:     25.691.374.090.830   CATTANEO LUIS RAUL  CATTANEO    LUIS   RAUL
#  999998: 12.236.174.155.090.900 CABRAL CARMEN ESTELA    CABRAL  CARMEN ESTELA
#  999999:    252.027.904.121.210     ITURGOYEN HECTOR ITURGOYEN  HECTOR     NA
# 1000000: 11.136.991.054.110.100         CASA CASILDO      CASA CASILDO     NA

Correct the set_method function and compare timings

set_method <- function(DT){
  cols = paste0("V",1:3)
  for(j in 1:3){
    set(DT,i=NULL,j=cols[j],
        value = sapply(strsplit(as.character(DT[, Account_Desc, with = TRUE]),
                                "[ +]"), "[", j))
  }
}

system.time(set_method(big_n))
#    user  system elapsed 
#  25.319   0.022  25.586 

Reset the "big_n" dataset and try out str_split_fixed (ouch!)

big_n[, c("V1", "V2", "V3") := NULL]

library(stringr)
system.time(outBrodie <- cbind(big_n, as.data.table(str_split_fixed(
  big_n$Account_Desc, "[ +]", 4))))
#    user  system elapsed 
# 204.966   0.514 206.910 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    +1 I like the optimizations. However, you can save some additional time by creating a character matrix `m` instead of a logical matrix. I recommend replacing `NA` (logical) with `NA_character_`. This avoids type conversion when adding the strings to the matrix. – Sven Hohenstein Feb 16 '14 at 15:11
3

EDIT 3: Stealing Arun's blood and sweat:

cbind(n, as.data.table(str_split_fixed(n$Account_Desc, "[ +]", 4)))

This avoids the potentially costly by and produces the same result (plus the original name column).

EDIT2: as per Arun's comment, maybe:

n.2[, c(paste0("V", 1:4)):=as.list(str_split_fixed(Account_Desc, "[ +]", 4)), by=Subscription_Id]

But you still have the by. Old way:

n[, as.list(str_split_fixed(Account_Desc, "[ +]", 4)), by=Subscription_Id]                        

produces:

  #            Subscription_Id        V1      V2     V3 V4
  #  1: 13.855.231.846.091.000    AGUAYO   CARLA          
  #  2: 11.156.048.529.090.800     LEIVA LILIANA          
  #  3:     24.940.584.090.830   FULLANA   MARIA  LAURA   
  #  4:    242.753.039.111.124    PETREL  SERGIO          
  #  5:     27.843.782.090.830  IPTICKET     SRL          
  #  6: 13.773.513.145.090.800   LEDESMA ORLANDO          
  #  7:     25.691.374.090.830  CATTANEO    LUIS   RAUL   
  #  8: 12.236.174.155.090.900    CABRAL  CARMEN ESTELA   
  #  9:    252.027.904.121.210 ITURGOYEN  HECTOR          
  # 10: 11.136.991.054.110.100      CASA CASILDO      

EDIT: word of warning, some stringr functions can be slow (not sure if this one is). If this is still slow for your process, you may want to write your own function using strsplit and something to pad it to the appropriate length.

BrodieG
  • 51,669
  • 9
  • 93
  • 146
  • @Arun I'm happy to steal your answer, but you should probably post it as a faster alternative. In this case it's fairly different as opposed to a tweak. – BrodieG Feb 12 '14 at 16:52
  • I think this is way slower than OP's 2nd implementation – eddi Feb 12 '14 at 16:56
  • @Arun all of the attempts in this answer are much slower – eddi Feb 12 '14 at 16:58
  • than OP's 2nd implementation; and it's because the `stringr` function is super slow – eddi Feb 12 '14 at 17:00
  • but it's not :) regular `strsplit` is orders of magnitude faster; the bottleneck is not splitting - it's combining – eddi Feb 12 '14 at 17:03
  • 1
    OP's 2nd solution does the binding and that's again orders of magnitude faster than whatever `stringr` does – eddi Feb 12 '14 at 17:06
  • as you mentioned previously, Subscription_Id is unique. – marbel Feb 12 '14 at 17:09
  • @AnandaMahto, I actually remember a question where you noted this a few weeks ago, and that's why I put in my first edit at the very bottom before this conversation exploded; either way, even fixing that looks like this is not the best way. – BrodieG Feb 12 '14 at 18:35
  • Most of the stringr functions just wrap the base R regular expression functions. `str_split_fixed()` does a little more than the base functions, so it's a little slower. – hadley Feb 12 '14 at 22:50