R: Faster way of splitting huge dataframe

Question

I'm trying to split a huge dataframe into smaller dataframes on the basis of tags (e.g. ##10000), which occur in the column WORDS at irregular intervals:

# create sample dataframe (sorry, this is clumsy...)
WORDS <- vector()
for(i in 1:100){
    (WORDS <- append(WORDS,c("##10000",rep("some_word", sample(30,1, FALSE)))))
}
LENGTH <- length(WORDS)
LEMMAS <- rep("some_word", LENGTH)
POS <- rep("some_word", LENGTH)
TEXTCLASS <- rep("some_word", LENGTH)
DF <- data.frame(WORDS, LEMMAS, POS, TEXTCLASS)


#identify positions of ##10000 tags
POS_OF_TAGS <- grepl("##\\d+", DF$WORDS, perl=T)

#create a split variable 
SPLIT_VARIABLE <- cumsum(POS_OF_TAGS) 

#create lists which contains individual texts
ALL_TEXTS_SEPARATED <- split(DF, SPLIT_VARIABLE)

When I apply this to my actual dataframe, which has 200 to 300 million rows, the split() function is very slow. Can anyone think of a quicker way of arriving at the ALL_TEXTS_SEPARATED list? Thanks in advance! Thomas

Please don't post code like `rm(list=ls(all=T))` in your questions unless it is absolutely critical to the point of the question. You wouldn't want anyone running that line accidentally. — Gregor Thomas, Feb 23 '17 at 18:39
This may not be much faster: `split(DF, cumsum(grepl("##\\d+", DF$WORDS, perl=TRUE)))` — d.b, Feb 23 '17 at 18:43
I wouldn't use regex if all the tags just start with `##`, an exact search will me much faster. Also, data.table has a new (improved) split method that could be useful. I would go with `library(data.table) ; S <- split(setDT(DF)[, id := cumsum(grepl("##", WORDS, fixed = TRUE))], by = "id")` — David Arenburg, Feb 23 '17 at 18:48
perhaps of interest http://stackoverflow.com/questions/39545400/why-is-split-inefficient-on-large-data-frames-with-many-groups — user20650, Feb 23 '17 at 19:56

Znusgy · Accepted Answer · 2017-02-26T15:54:02.450

0

David Arenburg's suggestion does the trick. Splitting 30 million lines with the following command takes roughly 8 seconds.

library(data.table) ; S <- split(setDT(DF)[, id := cumsum(grepl("##", WORDS, fixed = TRUE))], by = "id")

This is a enormous improvement - thanks so much!

edited Feb 26 '17 at 15:54

answered Feb 24 '17 at 17:16

Znusgy

37
1
5

R: Faster way of splitting huge dataframe

1 Answers1