0

When I want to parse a .docx file, I am doing that :

public String parseDOCX(String fileNameorFilePath )
    {
        try {
            XWPFDocument docx = new XWPFDocument(new FileInputStream(fileNameorFilePath));
            XWPFWordExtractor xwpfWordExtractor = new XWPFWordExtractor(docx);
            return xwpfWordExtractor.getText();
        }
        catch ( Exception error )
        {
            throw  new RuntimeException(error);
        }
    }

When I use this code to parse a .doc file(Word 97-2003) I am getting this exception :

Caused by: org.apache.poi.openxml4j.exceptions.InvalidFormatException: Package should contain a content type part [M1.13]

What would be the best way to open a .doc file?

thomas
  • 1,201
  • 2
  • 15
  • 35
  • Related: [Package should contain a content type part](https://stackoverflow.com/questions/32878743/package-should-contain-a-content-type-part-m1-13/49130309). doc and docx extensions use different classes, like xls and xlsx – jhamon May 14 '20 at 09:07

1 Answers1

2

According to their documentation:

HWPF is the name of our port of the Microsoft Word 97(-2007) file format to pure Java. It also provides limited read only support for the older Word 6 and Word 95 file formats.

The partner to HWPF for the new Word 2007 .docx format is XWPF. Whilst HWPF and XWPF provide similar features, there is not a common interface across the two of them at this time.

In other words: nothing in your code should say XWPFDocument, you need to use the corresponding interfaces classes built for HWPF.

Community
  • 1
  • 1
GhostCat
  • 137,827
  • 25
  • 176
  • 248
  • Using HWPF : HWPFDocument docx = new HWPFDocument(new FileInputStream(fileNameorFilePath)); I am getting this error : java.lang.NoSuchMethodError: org.apache.poi.POIDocument: method ()V not found at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:144) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:133) – thomas May 14 '20 at 09:26
  • Remove the older Apache POI jars from your classpath - mixing jars between versions is not supported! – Gagravarr May 14 '20 at 09:30
  • @thomas Also note: please be careful about getting into comment-question ping pong. And yes, that exception indicates a version mismatch of some sort. – GhostCat May 14 '20 at 09:34