0

I have done for .docx files as below but for .doc file it is throwing InvalidFormatException.

public boolean checkForEmbeddedObj(File wordFile){

   InputStream inStream = new FileInputStream(wordFile);
   XWPFDocument xwDoc = new XWPFDocument(inStream );
   return xwDoc.getAllEmbedds().isEmpty();
}

Any idea How can I do the same for .doc files ?

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
saloni
  • 37
  • 1
  • 9

1 Answers1

0

DOCX and DOC file have different specs and they are implemented differently in Apache POI.

DOCX files:

  • use poi-ooxml library and XWPFDocument class

DOC files

  • use poi-scratchpad library and HWPFDocument class

OLD Doc files

  • use poi-scratchpad library and HWPFOldDocument class

In order to extract embedded data from doc file, you can use OLE2ExtractorFactory.getEmbededDocsTextExtractors as following:

import org.apache.poi.extractor.OLE2ExtractorFactory;
import org.apache.poi.extractor.POITextExtractor;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

    void hwpfExtractor(File wordFile) throws IOException {

        HWPFDocument doc = new HWPFDocument(new FileInputStream(wordFile));

        POITextExtractor[] embeddedExtractors = OLE2ExtractorFactory.getEmbededDocsTextExtractors(new WordExtractor(doc));

        for (POITextExtractor ext : embeddedExtractors) {

            //ext could be one of the instance of org.apache.poi.extractor.POITextExtractor
            if (ext instanceof XXX) {
                // do stuff

            }


        }
    }

See also:

gtiwari333
  • 24,554
  • 15
  • 75
  • 102
  • Thankyou for your answer. I couldn't find any methods to check for embedded objects in HWPFDocument class. am I missing something here ? – saloni Sep 30 '20 at 15:02
  • I am using poi 3.7 OLE2ExtractorFactory class is not available in this version. I have tried [link](https://poi.apache.org/text-extraction.html) this but getting **No supported documents found in OLE2 stream** error I dont want to read embedded object , just need to detect embedded object in doc. – saloni Oct 01 '20 at 06:47
  • `POITextExtractor[] embeddedExtractors` just check the size of this to detect embedded objects. – gtiwari333 Oct 01 '20 at 16:14
  • When you say i'm getting XYZ error.. please provide the full stacktrace for that. Stacktrace tells everything needed to find out the cause and try solving the problem. – gtiwari333 Oct 01 '20 at 16:15