Extract text from content control cell in Word

Asked May 09 '19 at 14:49

Active May 09 '19 at 14:49

Viewed 170 times

I need to extract text from Word documents that contain parts formatted as Rich Text Content Control (RTCC). I am using officer. The problem is that I am not able to extract text formatted as RTCC. Any ideas on how to do this?

library(officer)

trtDoc <- read_docx("theFile.docx") %>%
          docx_summary()

The code above gives me a data.frame with the text, but the RTCC formatted text does not show.

asked May 09 '19 at 14:49

Bruno Guarita

1

This is not implemented within officer. You can unzip and extract raw xml (in `word/document.xml` with package xml2). – David Gohel May 09 '19 at 15:45
Thanks! Do I need to convert the document from docx into xml first? – Bruno Guarita May 09 '19 at 16:22
1

yes, use `officer::unpack_folder`, you will get the directory you are looking for. – David Gohel May 09 '19 at 16:37

Extract text from content control cell in Word

0 Answers0