1

I am trying to parse a word file which is an older version ( Word version 2 ) in JAVA , I am using Apache Tika to parse the word document and Apache POI throws the below exception

 org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@69565369
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at com.gsk.rd.dcs.dsd.pier.parser.xmlparser.PierParser.processMappings(PierParser.java:93)
    ... 4 more
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data appears to be an old Word version 2 file. Apache POI doesn't currently support this format
    at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
    at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:285)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:123)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    ... 7 more

I am using the below code


import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;
import java.io.File;
import java.io.FileInputStream;


public class Tika {
    public static void main(String[] args) {
        try {
            String file_path=args[0];
            File input_file=new File(file_path);
            FileInputStream fileInputStream = new FileInputStream(input_file);

            org.apache.tika.parser.Parser parser = new AutoDetectParser();
            ContentHandler handler = new ToXMLContentHandler();
            TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
            ocrConfig.setLanguage("eng");

            ocrConfig.setTesseractPath("");
            ocrConfig.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");

            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setExtractInlineImages(true);
            pdfConfig.setExtractUniqueInlineImagesOnly(false);

            ParseContext parseContext = new ParseContext();

            parseContext.set(PDFParserConfig.class, pdfConfig);
            parseContext.set(TesseractOCRConfig.class, ocrConfig);
            parseContext.set(org.apache.tika.parser.Parser.class, parser);

            Metadata metadata = new Metadata();
            parser.parse(fileInputStream, handler, metadata, parseContext);

            String text = handler.toString().trim();

        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

I am using the Apache Tika below

    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.24</version>
    </dependency>

Apache Tika version 1.24 is pulling Apache POI 4.1.2 in maven dependencies

+-org.apache.tika:tika-parsers:1.24
    +-org.apache.poi:poi:4.1.2
      +-org.apache.commons:commons-math3:3.6.1

the error indicates that Apache POI doesn't support old Word version 2 file

Is there any way to parse the word version 2 documents ?

Thanks in Advance.

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
Akhil
  • 391
  • 3
  • 20
  • Open in Word and Save-As to a newer format? – Gagravarr Jan 11 '21 at 14:08
  • @Gagravarr there are thousands of documents which I need to parse and we can't do it for every single file – Akhil Jan 11 '21 at 18:48
  • 1
    You can use JACOB (Java-COM bridge) to open and save documents in Word. – nguyenq Jan 11 '21 at 20:11
  • Sponsor adding the missing format in Apache Tika then? Otherwise try the approach in https://stackoverflow.com/questions/50325154/parsing-converting-legacy-word-documents-msword2-5 – Gagravarr Jan 11 '21 at 20:17

0 Answers0