I need to extract all subsections (for further text analysis) and their title from an .Rmd file (e.g. from 01-tidy-text.Rmd
of tidy-text-mining book:
https://raw.githubusercontent.com/dgrtwo/tidy-text-mining/master/01-tidy-text.Rmd)
All I know that a section starts from ##
sign and runs till either next #
, ##
signs or the end of the file.
The entire text is already extracted (using dt <- readtext("01-tidy-text.Rmd"); strEntireText <-dt[1,1]
) and is located variable strEntireText
.
I would like to use stringr
for this. or stringi
, something along the lines:
strAllSections <- str_extract(strEntireText , pattern="...")
strAllSectionsTitles <- str_extract(strEntireText , pattern="...")
Please suggest your solution. Thank you
The final objective of this exercise is to be able to automatically create a data.frame from .Rmd file, where each row corresponds to each section (and subsection), columns containing: section title, section label, section text itself, and some other section-specific details, which will be extracted later.