Spark - Scala: Parsing and extracting a document which has both Text and Image - .doc, .docx files

Question

I have few files (doc,docx files) which contains both Image and Text. I would like to parse these files and extract the contents,with or without Image details.

Currently I am using Apache Tika which refuses to parse such files. its working perfectly for PDF, and plain text .doc, .docx files. But the files which has images is throwing error :

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.commons.compress.utils.IOUtils.readFully(Ljava/io/InputStream;[B)I at org.apache.tika.parser.pkg.TikaArchiveStreamFactory.detect(TikaArchiveStreamFactory.java:472) at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112)

Is there any way to extract contents from these files. ?

score 0 · Accepted Answer · answered Jul 14 '17 at 11:09

0

Converting all my files to PDF document. Then using a Tika Parser - TesseractOCR on them.

answered Jul 14 '17 at 11:09

Sija Balakrishnan

1
5

Spark - Scala: Parsing and extracting a document which has both Text and Image - .doc, .docx files

1 Answers1