How to apply officer::read_docx to whole folder

Question

I am attempting to scan many documents, with the purpose of reorganizing the text into a standard format. This involves either extracting the table using docxtractr, and extracting the body text separately using textreadr, or using officer::docx_summary to label the body and table text for easier manipulation. For this problem, I'm using officer::read_docx and officer::docx_summary. The test documents I'm using are .docx, and contain nonsense text before and after a table that includes text and numbers.

My code is:

dir <- "C:/path/to/documents"
filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- officer::docx_summary(lapply(filenames, officer::read_docx))

Ideally it would produce a list of dataframes that contain the docx_summary information. I tried to use lapply on a list of filenames, but the output list gives an error when trying to view:

Error in names[[i]]: subscript out of bounds.

Have you tried the function on a single filename path first? Please provide some example input data, if you'd like someone to attempt to reproduce and fix your problem. — David Foster, Feb 22 '18 at 17:26

score 1 · Accepted Answer · answered Feb 22 '18 at 17:33

1

officer::docx_summary is for an object returned by officer::read_docx, it does not support list...

filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- lapply(filenames, function(x) officer::docx_summary(officer::read_docx(x)) )

answered Feb 22 '18 at 17:33

David Gohel

9,180
2
16
34

Ah, I see. Thank you so much for correcting my syntax. This works. – Anonymous coward Feb 22 '18 at 17:38

How to apply officer::read_docx to whole folder

1 Answers1