I'm trying to split a huge dataframe into smaller dataframes on the basis of tags (e.g. ##10000
), which occur in the column WORDS
at irregular intervals:
# create sample dataframe (sorry, this is clumsy...)
WORDS <- vector()
for(i in 1:100){
(WORDS <- append(WORDS,c("##10000",rep("some_word", sample(30,1, FALSE)))))
}
LENGTH <- length(WORDS)
LEMMAS <- rep("some_word", LENGTH)
POS <- rep("some_word", LENGTH)
TEXTCLASS <- rep("some_word", LENGTH)
DF <- data.frame(WORDS, LEMMAS, POS, TEXTCLASS)
#identify positions of ##10000 tags
POS_OF_TAGS <- grepl("##\\d+", DF$WORDS, perl=T)
#create a split variable
SPLIT_VARIABLE <- cumsum(POS_OF_TAGS)
#create lists which contains individual texts
ALL_TEXTS_SEPARATED <- split(DF, SPLIT_VARIABLE)
When I apply this to my actual dataframe, which has 200 to 300 million rows, the split() function is very slow. Can anyone think of a quicker way of arriving at the ALL_TEXTS_SEPARATED
list? Thanks in advance!
Thomas