4

I'm trying to traverse through a word document and save all the images found in the word document. I tried uploading the sample word document to the online demo and noticed that images are listed as:

/word/media/image1.png  rId5    image/png
/word/media/image2.png  rId5    image/png
/word/media/image3.jpg  rId5    image/jpeg

How can I programmatically save these images while traversing the document?

Currently I get all the text from the document like this:

   WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(filePath))
   MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart()
   Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement()
   Body body =  wmlDocumentEl.getBody();
   DocumentTraverser traverser = new DocumentTraverser();

   class DocumentTraverser  extends TraversalUtil.CallbackImpl {
      @Override
      public List<Object> apply(Object o) {
         if (o instanceof org.docx4j.wml.Text) {
         ....
         }
         return null;
      }
   }
birdy
  • 9,286
  • 24
  • 107
  • 171
  • Do you care about the context of the images (ie order, surrounding text), or do you just want to dump them somewhere? – JasonPlutext Oct 28 '14 at 00:14
  • Although that would be good information to have later on...right now just dumping them will suffice. – birdy Oct 28 '14 at 02:35
  • just check this link (http://cnedelcu.blogspot.in/2013/02/top-3-ways-to-extract-images-from-word-docx-doc-document.html) may be useful to you – yugi Oct 30 '14 at 04:27
  • I'm looking to do this programmatically. The link you suggested mentions ways of doing this manually. I believe there is a way to do this programmatically as @JasonPlutext hinted. – Omnipresent Oct 30 '14 at 12:07
  • check for apache POI may be it is capable of doing this. – Mayur Gupta Nov 05 '14 at 13:12

2 Answers2

3

For embedded (as opposed to external) images, the simplest approach is:

import java.io.FileOutputStream;
import java.util.Map.Entry;

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.Part;
import org.docx4j.openpackaging.parts.PartName;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPart;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage;

public class SaveImages  {

        public static void main(String[] args) throws Exception {

            WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));

            for (Entry<PartName, Part> entry : wordMLPackage.getParts().getParts().entrySet()) {

                if (entry.getValue() instanceof BinaryPartAbstractImage) {

                    FileOutputStream fos = new FileOutputStream( yourfile ); // TODO: you can get file extension from PartName, or part class.
                    ((BinaryPart)entry.getValue()).writeDataToOutputStream(fos);
                    fos.close();

                }


            }
        }

    }

If you care about the context of the images, you have to search for them in the relevant parts (eg MainDocumentPart, and your header/footer parts etc as required).

https://github.com/plutext/docx4j/blob/master/src/samples/docx4j/org/docx4j/samples/ImageConvertEmbeddedToLinked.java will give you a hint as to how to do that. Note that there are two different XML structures for images. The newer DrawingML XML, and the legacy VML.

JasonPlutext
  • 15,352
  • 4
  • 44
  • 84
  • Great this works! Is there a way to not pull the thumbnail image? this seems to pull the thumbnail image as well. – Anthony Nov 04 '14 at 16:43
0

To access the embedded images in a .docx file, use the following steps:

◾If it's not already a .docx file, Open the file in Word 2007 and save the file as a Word Document (*.docx). ◾Change the file extension on the original file from .docx to .zip, as shown in Figure D.