How to extract images from uploaded word document in Shiny

Question

I am working on a Shiny app that reads Word documents uploaded by users. The uploaded document then displays a table of all elements in the document and their formatting. I want it to also show any pictures from the uploaded Word doc. Documents containing multiple images aren't an issue - users will only ever upload documents with one image.

To do this, I am using the officer package. It has a function called media_extract where you can do exactly what I want. The issue is, while the documentation says this function can be used to extract images from .doc or .ppt files, I can only get it to work for the latter. This is because media_extract takes the image file path as an argument, but I cannot generate a file path for Word docs. The file path is generated by using one of two officer functions depending on the file type: docx_summary or pptx_summary. These are also the functions I use to generate the tables rendered in my app. The pptx_summary creates a table with a media_path column, which displays a file path for image elements, while docx_summary generates no such column. Absent that column and the path it includes, I don't know how to extract images from Word docs using this function.

For your convenience, here is my code for two Shiny apps: one that reads powerpoints and one for word docs. If you upload a powerpoint file and word file that include an image you will see how the tables generated in each app are different. My powerpoint app also renders an image, to show you how that is done. Obviously that functionality is not in my word app...

Powerpoint reader app:

library(officer)
library(DT)
library(shiny)

ui<- fluidPage(

  titlePanel("Document Scanner"),
  sidebarLayout(
    sidebarPanel(
      fileInput("uploadedfile", "Upload a file", multiple=FALSE,
                accept=c(".ppt", ".pptx", ".docx")) 
    ),
    mainPanel(
      tags$h3(tags$b("Document Summary")),
      br(),
      DT::dataTableOutput("display_table"),
      br(),
      imageOutput("myImage")
    )
  )
)
server<-function(input,output) {
  #creating reactive value for uploaded file
  x<-reactive({
    uploadedfileDF<- input$uploadedfile
    uploadedfileDataPath<- uploadedfileDF$datapath
    read_pptx(uploadedfileDataPath)


  })

  #rendering formatting table
  output$display_table<-DT::renderDataTable({

    req(input$uploadedfile)
    DT::datatable(pptx_summary(x()))
  })


  #rendering images from powerpoint
  output$myImage<-renderImage({

    readFile<-x()
    fileSummaryDF<-pptx_summary(readFile)
#Getting path to image (this is basically straight from the documentation 
#for media_extract)
    fileSummaryDF_filtered<- fileSummaryDF[fileSummaryDF$content_type %in% "image", ]
    media_file <- fileSummaryDF_filtered$media_file
    png_file <- tempfile(fileext = ".png")
    media_extract(readFile, path = media_file, target = png_file)

    list(src = png_file,
         alt="Test Picture")
  })
}
shinyApp(ui, server)

Word reader app:

library(officer)
library(DT)
library(shiny)

ui<- fluidPage(

  titlePanel("Word Doc Scanner"),
  sidebarLayout(
    sidebarPanel(
      fileInput("uploadedfile", "Upload a file", multiple=FALSE,
                accept=c(".doc", ".docx")) 
    ),
    mainPanel(
      tags$h3(tags$b("Document Summary")),
      br(),
      DT::dataTableOutput("display_table"),
      imageOutput("image1")
    )
  )
)
server<-function(input,output) {

  # creating reactive content from uploaded file
  x<-reactive({
    print(input$uploadedfile)
    uploadedfileDF<- input$uploadedfile
    uploadedfileDataPath<- uploadedfileDF$datapath
    docDF<-read_docx(path=uploadedfileDataPath)
    summaryDF<-docx_summary(docDF)
  })

  #rendering formatting table 

  output$display_table<-DT::renderDataTable({

    req(input$uploadedfile)
   DT::datatable(x())
  })

  #how to render image without a image path anywhere in table?
}


shinyApp(ui, server)

If this can't be done in officer then I'm happy to do it a different way. Thank you.

It's really just a ZIP file. Rename it to a tempfile with a `.zip` and use `unzip()` to unzip it. Look for a `word/media/` subdir and the images are there. The source for the `docxtractr` package has code for the zip part. — hrbrmstr, Nov 14 '18 at 02:09
I was indeed able to find the image by going using the `word/media/` subdir and thus create the path to it. That was simple enough. I did not need to do any unzipping as you mentioned. Anyway, problem solved. Thank you. — IanG, Nov 14 '18 at 21:16
a word doc is a zipped archive so _something_ had to do the unzipping. — hrbrmstr, Nov 14 '18 at 21:17
Then I assume the `read_docx` function from the `officer` package must have unzipped it. — IanG, Nov 15 '18 at 05:00

How to extract images from uploaded word document in Shiny

0 Answers0

Linked