Can Apache Tika Extract Attachments?

Question

I am using Apache Tika to extract text from various document formats. I would like to extract images from those files as well (usually PDF or Word).

I am using TikaCLI as a proof of concept with the -z (--extract) option, but it never extracts any attachments. The help screen for TikaCLI and a few web sites out there suggest this should work. I get no output from Tika:

C:\work>Setup.CIPDev-6-3-0-2583\java\bin\java.exe -jar Setup.CIPDev-6-3-0-2583\tomcat\webapps\JavaBridge\WEB-INF\lib\tika-app-1.3.jar -z attachment.pdf

I have tried a variety of arguments, files, and attachment combinations with no success. Has anyone successfully extracted attachments from files with Apache Tika? If so, can you provide some guidance on how you did it?

Any help is greatly appreciated.

Are you sure the file you're trying with actually has embedded resources? If you try with `-z` on a supported file, you'll see something like `Extracting 'image1.emf' (application/x-emf) to ./image1.emf` output by TikaApp to let you know what it did — Gagravarr, Jul 23 '13 at 12:17
I believe the answer is yes unless I have the wrong idea about embedded resources. My primary test document is a PDF I created with an attached image file that shows up as an attachment in the viewer. In my test Word documents I just pasted images into the document and saved it. I also tried images in the PDF just in the middle of the document content instead of attachments. — jriffel73, Jul 23 '13 at 12:43
I cannot answer my own question, but I'll comment here for future reference. It turns out the problem I had with extracting attachments in PDFs had to do with the file format of the attachments. Apparently Tika only extracts file types it understands how to parse. I assumed it would extract any attachment type. It is also worth mentioning that image data rendered in the PDF document does not seem to extract either, just embedded file attachments. — jriffel73, Jul 23 '13 at 16:16
FYI, What Tika can do, and what the Tika App offers can be different - not everything that Tika supports is exposed through the command line App — Gagravarr, Jul 23 '13 at 19:41
We've added the ability to extract "inline" images from PDFs. You do need to do some configuration to get that to work, though, because there are some crazy pdfs in the wild. If you're still interested, drop a note to the tika users list. — Tim Allison, May 13 '16 at 18:54

Can Apache Tika Extract Attachments?

0 Answers0