0

I am attempting to scan many documents, with the purpose of reorganizing the text into a standard format. This involves either extracting the table using docxtractr, and extracting the body text separately using textreadr, or using officer::docx_summary to label the body and table text for easier manipulation. For this problem, I'm using officer::read_docx and officer::docx_summary. The test documents I'm using are .docx, and contain nonsense text before and after a table that includes text and numbers.

My code is:

dir <- "C:/path/to/documents"
filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- officer::docx_summary(lapply(filenames, officer::read_docx))

Ideally it would produce a list of dataframes that contain the docx_summary information. I tried to use lapply on a list of filenames, but the output list gives an error when trying to view:

Error in names[[i]]: subscript out of bounds.
Anonymous coward
  • 2,061
  • 1
  • 16
  • 29
  • Have you tried the function on a single filename path first? Please provide some example input data, if you'd like someone to attempt to reproduce and fix your problem. – David Foster Feb 22 '18 at 17:26

1 Answers1

1

officer::docx_summary is for an object returned by officer::read_docx, it does not support list...

filenames <- list.files(path = dir, pattern = "*.docx", full.names = TRUE)
docxtest <- lapply(filenames, function(x) officer::docx_summary(officer::read_docx(x)) )
David Gohel
  • 9,180
  • 2
  • 16
  • 34