0

I'm trying to split a huge dataframe into smaller dataframes on the basis of tags (e.g. ##10000), which occur in the column WORDS at irregular intervals:

# create sample dataframe (sorry, this is clumsy...)
WORDS <- vector()
for(i in 1:100){
    (WORDS <- append(WORDS,c("##10000",rep("some_word", sample(30,1, FALSE)))))
}
LENGTH <- length(WORDS)
LEMMAS <- rep("some_word", LENGTH)
POS <- rep("some_word", LENGTH)
TEXTCLASS <- rep("some_word", LENGTH)
DF <- data.frame(WORDS, LEMMAS, POS, TEXTCLASS)


#identify positions of ##10000 tags
POS_OF_TAGS <- grepl("##\\d+", DF$WORDS, perl=T)

#create a split variable 
SPLIT_VARIABLE <- cumsum(POS_OF_TAGS) 

#create lists which contains individual texts
ALL_TEXTS_SEPARATED <- split(DF, SPLIT_VARIABLE)

When I apply this to my actual dataframe, which has 200 to 300 million rows, the split() function is very slow. Can anyone think of a quicker way of arriving at the ALL_TEXTS_SEPARATED list? Thanks in advance! Thomas

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Znusgy
  • 37
  • 1
  • 5
  • 2
    Please don't post code like `rm(list=ls(all=T))` in your questions unless it is absolutely critical to the point of the question. You wouldn't want anyone running that line accidentally. – Gregor Thomas Feb 23 '17 at 18:39
  • This may not be much faster: `split(DF, cumsum(grepl("##\\d+", DF$WORDS, perl=TRUE)))` – d.b Feb 23 '17 at 18:43
  • 2
    I wouldn't use regex if all the tags just start with `##`, an exact search will me much faster. Also, data.table has a new (improved) split method that could be useful. I would go with `library(data.table) ; S <- split(setDT(DF)[, id := cumsum(grepl("##", WORDS, fixed = TRUE))], by = "id")` – David Arenburg Feb 23 '17 at 18:48
  • perhaps of interest http://stackoverflow.com/questions/39545400/why-is-split-inefficient-on-large-data-frames-with-many-groups – user20650 Feb 23 '17 at 19:56

1 Answers1

0

David Arenburg's suggestion does the trick. Splitting 30 million lines with the following command takes roughly 8 seconds.

library(data.table) ; S <- split(setDT(DF)[, id := cumsum(grepl("##", WORDS, fixed = TRUE))], by = "id")

This is a enormous improvement - thanks so much!

Znusgy
  • 37
  • 1
  • 5