I am trying to parse a word file which is an older version ( Word version 2 ) in JAVA , I am using Apache Tika to parse the word document and Apache POI throws the below exception
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@69565369
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at com.gsk.rd.dcs.dsd.pier.parser.xmlparser.PierParser.processMappings(PierParser.java:93)
... 4 more
Caused by: org.apache.poi.poifs.filesystem.NotOLE2FileException: The supplied data appears to be an old Word version 2 file. Apache POI doesn't currently support this format
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:117)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:285)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:123)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 7 more
I am using the below code
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;
import java.io.File;
import java.io.FileInputStream;
public class Tika {
public static void main(String[] args) {
try {
String file_path=args[0];
File input_file=new File(file_path);
FileInputStream fileInputStream = new FileInputStream(input_file);
org.apache.tika.parser.Parser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
ocrConfig.setLanguage("eng");
ocrConfig.setTesseractPath("");
ocrConfig.setTessdataPath("C:\\Program Files\\Tesseract-OCR\\tessdata");
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setExtractUniqueInlineImagesOnly(false);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(TesseractOCRConfig.class, ocrConfig);
parseContext.set(org.apache.tika.parser.Parser.class, parser);
Metadata metadata = new Metadata();
parser.parse(fileInputStream, handler, metadata, parseContext);
String text = handler.toString().trim();
} catch (Exception e) {
e.printStackTrace();
}
}
}
I am using the Apache Tika below
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.24</version>
</dependency>
Apache Tika version 1.24 is pulling Apache POI 4.1.2 in maven dependencies
+-org.apache.tika:tika-parsers:1.24
+-org.apache.poi:poi:4.1.2
+-org.apache.commons:commons-math3:3.6.1
the error indicates that Apache POI doesn't support old Word version 2 file
Is there any way to parse the word version 2 documents ?
Thanks in Advance.