I'm a bit new still to using data.table and understanding all its subtleties. I've looked in the doc and in other examples in SO but couldn't find what I want, so please help !
I have a data.table which is basically a char vector (each entry being a sentence)
DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
# > DT
# text
# 1: I love you
# 2: she loves me
What I'd like to do, is to be able to perform some basic string operations inside the DT object. For example, add a new column where I would have a char vector for which each entry is a WORD from the string in the "text" column.
so I'd like to have for example a new column charvec where
> DT[1]$charvec
[1] "I" "love "you"
Of course, I would like to do it the data.table way, ultra-fast, because I need to do this kind of things on fils which are >1Go file, and use more complex and computation-heavy functions. So no use of APPLY, LAPPLY, and MAPPLY
My closest attempt to do something which looks like it is as follow:
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
# text charvec
# 1: I love you I,love,you
# 2: she loves me she,loves,me
For example, to make a function which removes the first word of each sentence, I did this
myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
# text charvec
# 1: I love you love,you
# 2: she loves me loves,me
the trouble is, in the column charvec, i've got a list and not a vector...
> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"
1) how can i get to do what i want ? other kind of functions i'm thinking to use is subsetting the char vector, or applying some hash to it, etc..
2) BTW, can I get to DU2 or DV2 in one line instead of two lines ?
3) i don't understand well the syntax for data.table. why is it that with the command list()
inside the [..], the column V1 vanishes ?
4) on another thread, i read a bit about the function cSplit
.
. is it any good ? is it a function adapted to data.table objects ?
thanks very much
UPDATE
thanks to @Ananda Mahto Perhaps i should make myself more clear of my ultimate objective I have a huge file of 10,000,000 sentences stored as string. As a first step for that project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences wouldn't even get in my memory, so i did first split into 10 files of 1,000,000 sentences, that would be around a 10x 1Go files. the following code takes several minutes on my laptop just for a single file.
library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT
hash2 <- function(word){ #using library(digest)
as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}
then,
print(system.time({
colnames(rawdata) <- "sentence"
rawdata <- lapply(rawdata,strsplit," ")
sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
hash_list <- sapply(sentences_begin,hash2)
# remove(rawdata)
})) ## end of print system.time for loading the data
I know I'm pushing here R to its limits, but i'm struggling to find faster implementations, and i was thinking about data.table features...hence all my questions
Here is an implementation excluding lapply, but its actually slower !
print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
rebuildsentence <- function(S){
paste(S,collapse=" ") }
myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}
DW1 <- DV2[,myfun3(charvec),by=text]
})) #end of system.time
In this implementation with data file, no lapply, so i hoped the hashing would be faster. However because in every column i have a list instead of a char vector, this may slow significantly (?) the whole thing.
Using the first code above (with lapply
/sapply
) took more than 1 hour on my laptop. I hoped to speed that with a more efficient data structure ?. People using Python, Java etc... do a similar job in a few seconds.
Of course, another road would be to find a faster hash function but I assumed the one in digest
package was already optimized.