Reading the body text of an rdocx object with OfficeR

Question

I am trying to read the body of a .docx file with the officer package and I am running into an error:

library(officer)

docx1 <- system.file(package = "officer", "template.docx")
content <- docx_summary(docx1)

Error in x$doc_obj : $ operator is invalid for atomic vectors**

docx2 <- read_docx("template.docx")
content <- docx_summary(docx2)

Error in data.frame(level = as.integer(xml_attr(xml_child(node, "w:pPr/w:numPr/w:ilvl"), : arguments imply differing number of rows: 1, 0**

length(docx1) 
# 1
length(docx2) 
# 37

When I run docx2 I get some interesting information including all the style and then I get this:

text                                  
1.1                                   
Question 10:                          
1.4                                   
Some text here also                   
1.7                                   
Text for a heading                    
1.10                                  
1.13                                  
10.1                                  
1.16                                  
10.2                                  
1.19                                  
2.2                                   
<NA>                                  
2.5                                   
<NA>                                  
2.8                                   
<NA>                                  
2.11                                  
1 of 2 questions correct-50%

All of the text above is in fact in the body of the text I am trying to read. It is quite scrambled but it's what I am hoping to get in the correct order

`docx_summary()` is expecting an object returned by `read_docx()`, not a string — David Gohel, May 09 '23 at 07:11
I believe I am passing in an object from `read_docx()` to `docx_summary()` with `docx2`\Do you have any input as to why I am getting this error? `> class(docx2)` `[1] "rdocx"` `> docx_summary(docx2)` **Error in data.frame(level = as.integer(xml_attr(xml_child(node, "w:pPr/w:numPr/w:ilvl"), : arguments imply differing number of rows: 1, 0** — Cyrus Tadjiki, May 09 '23 at 17:59
Are you sure docx2 was created from a completely valid MS Word document file? The reason I'm asking this is because docx1 almost certainly wasn't (there doesn't appear to be a file called template.docx associated with the officer package, & I'm on 0.6.2, which is currently the latest version on CRAN). — Z.Lin, May 14 '23 at 11:26

Reading the body text of an rdocx object with OfficeR

0 Answers0