How to extract images from word using media_extract in r?

Question

I am working in rmarkdown to produce a report that extracts and displays images extracted from word.

To do this, I am using the officer package. It has a function called media_extract which can 'extract files from an rdocx or rpptx object'.

In word, I am struggling to locate the image without the media_path column.

The media_path is used as an argument in the media_extract function to locate the image. See example code from package documentation below:

example_pptx <- system.file(package = "officer",
  "doc_examples/example.pptx")
doc <- read_pptx(example_pptx)
content <- pptx_summary(doc)
image_row <- content[content$content_type %in% "image", ]
media_file <- image_row$media_file
png_file <- tempfile(fileext = ".png")
media_extract(doc, path = media_file, target = png_file)

The file path is generated using either; docx_summary or pptx_summary, depending on the file type, which create a data frame summary of the files. The pptx_summary includes a column media_path, which displays a file path for the image. The docx_summary data frame doesn't include this column. Another stackoverflow post posed a solution for this using word/media/ subdir which seemed to work, however I'm not sure what this means or how to use it?

How do I extract an image from a word doc, using word/media/ subdir as the media path?

l.iles · Answer 1 · 2022-02-28T16:10:03.847

I have continued to research this and found an answer, so I thought I would share!

The difficultly I was having extracting images from docx was due to the absence of a media_file column in the summary data frame (produced using docx_summary), which is used to locate the desired image. This column is present in the data frame produced for pptx pptx_summary and is used in the example code from the package documentation.

In the absence of this column you instead need to locate the image using the document subdirectory (file path when the docx is in XML format), which looks like: media_path <- "/word/media/image3.png"

If you want see what this structure looks like you can right click on your document >7-Zip>Extract files.. and a folder containing the document contents will be created, otherwise just change the image number to select the desired image. Note: sometimes images have names that do not follow the image.png format so you may need to extract the files to find the name of the desired image.

Example using media_extract with docx.

#extracting image from word doc using officer package 

report <- read_docx("/Users/user.name/Documents/mydoc.docx")

png_file <- tempfile(fileext = ".png")

media_file <- "/word/media/image3.png"

media_extract(report, path = media_file, target = png_file)

The output you are looking for is TRUE. The image can then be included in a report using knitr (or another method).

include_graphics(png_file)

How to extract images from word using media_extract in r?

1 Answers1