I have a huge string (> 500MB), actually it's an entire book collection in one. I have some meta information in another dataframe, e.g. page numbers, (different) authors and titles. I try to detect the title strings in my huge string and split it by title. I assume titles are unique.
The data looks like this:
mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"
# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
title = c( "Lorem", "maecenas"))
mydf
page title
1 1 Lorem
2 2 vivamus
mygoal <- mydf # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal
page title text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 vivamus habitasse ultrices aenean tempus
How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.