Manipulate char vectors inside a data.table object in R

Question

I'm a bit new still to using data.table and understanding all its subtleties. I've looked in the doc and in other examples in SO but couldn't find what I want, so please help !

I have a data.table which is basically a char vector (each entry being a sentence)

DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)

# > DT
#            text
# 1:   I love you
# 2: she loves me

What I'd like to do, is to be able to perform some basic string operations inside the DT object. For example, add a new column where I would have a char vector for which each entry is a WORD from the string in the "text" column.

so I'd like to have for example a new column charvec where

> DT[1]$charvec
[1] "I" "love "you"

Of course, I would like to do it the data.table way, ultra-fast, because I need to do this kind of things on fils which are >1Go file, and use more complex and computation-heavy functions. So no use of APPLY, LAPPLY, and MAPPLY

My closest attempt to do something which looks like it is as follow:

myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
#            text      charvec
# 1:   I love you   I,love,you
# 2: she loves me she,loves,me

For example, to make a function which removes the first word of each sentence, I did this

myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
#            text  charvec
# 1:   I love you love,you
# 2: she loves me loves,me

the trouble is, in the column charvec, i've got a list and not a vector...

> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"

1) how can i get to do what i want ? other kind of functions i'm thinking to use is subsetting the char vector, or applying some hash to it, etc..

2) BTW, can I get to DU2 or DV2 in one line instead of two lines ? 3) i don't understand well the syntax for data.table. why is it that with the command list() inside the [..], the column V1 vanishes ? 4) on another thread, i read a bit about the function cSplit.

. is it any good ? is it a function adapted to data.table objects ?

thanks very much

UPDATE

thanks to @Ananda Mahto Perhaps i should make myself more clear of my ultimate objective I have a huge file of 10,000,000 sentences stored as string. As a first step for that project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences wouldn't even get in my memory, so i did first split into 10 files of 1,000,000 sentences, that would be around a 10x 1Go files. the following code takes several minutes on my laptop just for a single file.

library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT

hash2 <- function(word){ #using library(digest)
        as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}

then,

print(system.time({ 

        colnames(rawdata) <- "sentence"
        rawdata <- lapply(rawdata,strsplit," ")

        sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
        hash_list <- sapply(sentences_begin,hash2)
        # remove(rawdata)
})) ## end of print system.time for loading the data

I know I'm pushing here R to its limits, but i'm struggling to find faster implementations, and i was thinking about data.table features...hence all my questions

Here is an implementation excluding lapply, but its actually slower !

print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]

myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]

rebuildsentence <- function(S){
        paste(S,collapse=" ") }

myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}

DW1 <- DV2[,myfun3(charvec),by=text]

})) #end of system.time

In this implementation with data file, no lapply, so i hoped the hashing would be faster. However because in every column i have a list instead of a char vector, this may slow significantly (?) the whole thing.

Using the first code above (with lapply/sapply) took more than 1 hour on my laptop. I hoped to speed that with a more efficient data structure ?. People using Python, Java etc... do a similar job in a few seconds.

Of course, another road would be to find a faster hash function but I assumed the one in digest package was already optimized.

as written above, the main goal is to avoid using lapply, sapply, mapply, use the data.table syntax only to manipulate very quickly a data.table of strings for large files (over 1Go, basically the maximum allowed by memory and speed....). I want to be able to convert to the column of strings into column of char vectors into the same table, and then to take a subset of it, like the first 5 words, or the last 5 words, i want to calculate the hash of those 5 words, or the hash of the full sentence...I've done that *not* using data.table , and lapply style, it took several minutes. way too long. — Fagui Curtain, Nov 18 '15 at 23:42
for my education, from the table DU2 and its second column (every element is a list) how would u reconstruct the original text (single string) ? — Fagui Curtain, Nov 19 '15 at 00:07
Have you tried a regex solution to extract the first five words and directly make a hash of that? — A5C1D2H2I1M1N2O1R2T1, Nov 19 '15 at 03:14
Regular expression. Not in front of a computer right now so can't propose a solution.... — A5C1D2H2I1M1N2O1R2T1, Nov 19 '15 at 03:15
Also, have you tried profiling your code to figure out where the slowdown is taking place? — A5C1D2H2I1M1N2O1R2T1, Nov 19 '15 at 03:29

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-11-19T07:45:14.493

I'm not really sure what you're after, but you can try cSplit_l from my "splitstackshape" package to get to your list column:

library(splitstackshape)
DU <- cSplit_l(DT, "DT", " ")

Then, you can write a function like the following to remove values from the list column:

RemovePos <- function(inList, pos = 1) {
  lapply(inList, function(x) x[-c(pos[pos <= length(x)])])
}

Example usage:

DU[, list(RemovePos(DT_list, 1)), by = DT]
#              DT       V1
# 1:   I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
#              DT     V1
# 1:   I love you  I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
#              DT  V1
# 1:   I love you you
# 2: she loves me  me

Update

Based on your loathe of `lapply, maybe you can try something like the following:

## make a copy of your "text" column
DT[, vals := text]

## Use `cSplit` to create a "long" dataset. 
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
DTL
#            text  vals ind
# 1:   I love you     I   1
# 2:   I love you  love   2
# 3:   I love you   you   3
# 4: she loves me   she   1
# 5: she loves me loves   2
# 6: she loves me    me   3

## Now, you can extract values easily
DTL[ind == 1]
#            text vals ind
# 1:   I love you    I   1
# 2: she loves me  she   1
DTL[ind %in% c(1, 3)]
#            text vals ind
# 1:   I love you    I   1
# 2:   I love you  you   3
# 3: she loves me  she   1
# 4: she loves me   me   3

Update 2

I don't know what type of timings you are getting, but as I mentioned in a comment, you can perhaps try using regular expressions so that you don't have to split and then paste the string back together.

Here's a sample....

Set up some data to play with:

library(data.table)
DT <- data.table(
  text = c("This is a sentence with a lot of words.",
           "This is a sentence with some more words.",
           "Words and words and even some more words.",
           "But, I don't know how you want to deal with punctuation...",
           "Just one more sentence, for easy multiplication.")
)

DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))

Test the gsub pattern to extract 5 words from each sentence....

## Regex to extract first five words -- this should work....
patt <- "^((?:\\S+\\s+){4}\\S+).*"

## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#    0.03    0.00    0.03 
system.time(temp2 <- DT3[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#       3       0       3 
head(temp)
# [1] "This is a sentence with"     "This is a sentence with"     "Words and words and even"   
# [4] "But, I don't know how"       "Just one more sentence, for" "This is a sentence with"

My guess at what you're looking to do....

## I'm assuming you want something like this....
## Takes about a minute on my system. 
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
  , firstFive := gsub(patt, "\\1", text)][
  , firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
#    user  system elapsed 
#   62.14    0.05   62.20 

head(out)
#                                                          text                   firstFive firstFiveHash
# 1:                    This is a sentence with a lot of words.     This is a sentence with    4179639471
# 2:                   This is a sentence with some more words.     This is a sentence with    4179639471
# 3:                  Words and words and even some more words.    Words and words and even    2556713080
# 4: But, I don't know how you want to deal with punctuation...       But, I don't know how    3765680401
# 5:           Just one more sentence, for easy multiplication. Just one more sentence, for     298317689
# 6:                    This is a sentence with a lot of words.     This is a sentence with    4179639471

i don't want any solution using lapply, mapply, or sapply. I want to use the data.table syntax. it is documented that those functions are obsolete with data.table and should *never* be used when looking for performance (or "pure efficient R code) look at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf — Fagui Curtain, Nov 18 '15 at 23:37
@FaguiCurtain, use `cSplit` with the "long" direction, add a numeric index using `.N`, and then you can subset based on the value created by `.N`. I'm not sure where you get the idea that `lapply` has been made obsolete by "data.table" and should *never* be used with "data.table". — A5C1D2H2I1M1N2O1R2T1, Nov 19 '15 at 02:09
Indeed, `*apply` functions are an efficient way of getting things done *within* `data.table` syntax. You may want to check https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf — PavoDive, Nov 19 '15 at 02:40
@AnandaMahto thanks for your time. I don't know what you're trying to do with the command `gsub` ?. I don't understand also the syntax of your last line. in the column first Five, i see more than 5 words. also, to make the problem simpler, in the original file i was working with, there was only space, no other punctuation. — Fagui Curtain, Nov 19 '15 at 05:42
@FaguiCurtain, please try based on my most recent edit. The `:::` is to access non-exported functions (like `trim`) but that is not required with the update I've added. — A5C1D2H2I1M1N2O1R2T1, Nov 19 '15 at 07:46
thanks Ananda for all your answers. Your implementation doesn't work faster, but still i learnt many useful things today thanks to your help.i think there is no other way than to find a faster hash function ! — Fagui Curtain, Nov 19 '15 at 15:19

Manipulate char vectors inside a data.table object in R

1 Answers1

Update

Update 2

Linked