5

I am developing standard alone Java batch process. I am trying to determine file attachment mimetype using Tika Jars. I am using Tika 1.4 Jar files.

My code look like

Parser parser= new AutoDetectParser();
InputStream stream = new FileInputStream(fileAttachment);
int writerHandler =-1;
ContentHandler contentHandler= new BodyContentHandler(writerHandler);
Metadata metadata= new Metadata();
parser.parse(stream, contentHandler, metadata, new ParseContext());
String mimeType = metadata.get(Metadata.CONTENT_TYPE);
logger.debug("File Attachment: "+fileattachment.getName()+" MimeType is: "+mimeType);

This code is not working properly for the office 03 and 07 documents.

While running from eclipse I am getting correct mimetypes.

I build jar file and running from command its giving wrong mimetypes.

out put from command
------------
File Attachment: Testpdf.pdf  MimeType is: application/pdf
File Attachment: Testpdf.tif  MimeType is: image/tiff
File Attachment: Testpdf.xlsx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xltx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.pptx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.docx  MimeType is: application/x-tika-ooxml
File Attachment: Testpdf.xls  MimeType is: application/zip
File Attachment: Testpdf.doc  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.dot  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.ppt  MimeType is: application/x-tika-msoffice
File Attachment: Testpdf.xlt  MimeType is: application/vnd.ms-excel

I tried with OfficePraser, OOXMLParser. Its not working. I have tried with tika 0.9 jar files. mimeTypes are coming correctly but if any one of my file attachment is "editable pdf" my batch process is dying (like "exit(0);" in code). If I have new tika jars its giving wrong mimeTypes.

Please help me in this. Thanks in advance.

CVSR Sarma

1 Answers1

8

Firstly, you're using the wrong bit of Apache Tika. If all you want to know is the file type, then you should use the Detection API (javadocs) directly, eg:

TikaConfig tika = new TikaConfig();

Metadata metadata = new Metadata();
metadata.set(TikaCoreProperties.RESOURCE_NAME_KEY, filename);
String mimetype = tika.getDetector().detect(stream, metadata);

If you have only the tika-core jar on your classpath, then the detection above will use Mime Magic and Filename hints. That'll let it get most files, especially if they have the right extension, but it'll struggle only wrongly named "container formats"

Container Formats are things like zip, ole2 etc, where one file format can hold many types (eg ods, xlsx, keynote all use .zip, .doc and .xls both use ole2). If you want to do detection that looks inside containers for more accurate results, you need to also include the tika-parsers-standard jar and its dependencies.

Note that, as explained in the Javadocs, your stream needs to support mark and reset for detection to work. This is so that Tika can read the first bit of your stream, look at it to work out what your file is, then return the stream to how it was ready for other uses (eg parsing). Most streams should, but if yours doesn't, the simplest way to fix it is to wrap it in a TikaInputStream via TikaInputStream.get, which sorts all that out for you

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • @ Gagravarr tika.getDetector().detect(stream, metadata); returns Mediatype. I tried that. its not working. –  Mar 07 '14 at 07:32
  • Make sure you pass in the filename as shown, and if you want properly accurate results make sure you've included the tika-parser jar + dependencies on the classpath, exactly as the answer says... – Gagravarr Mar 07 '14 at 09:36
  • Its working in Eclipse. Once I build the jar file and running from command prompt its not working. I have downloaded jars and dependencies using maven and included all jar files in class path. @Gagravarr –  Mar 07 '14 at 11:04
  • If it works in Eclipse, but not when run standalone, then the problem is you don't have the same jars on your classpath, no matter what you might think. Ensure they're really there, and you don't have any older ones in the way confusing things. Ask a new question if you don't know how to check what jars you've really got in use on your classpath – Gagravarr Mar 07 '14 at 13:45
  • @Gagravarr How to add a custom file type for tika to detect – kittu Jun 18 '15 at 05:22
  • @kittu You need to ask that as a new question, rather than trying to hijack other people's questions... – Gagravarr Jun 18 '15 at 05:27
  • @Gagravarr Sorry but I did that already and I didn't get an answer till now and unable to find much info on this so luckily found you answering about this topic on couple of posts. Here's my question: http://stackoverflow.com/questions/30895761/how-to-add-new-mime-type-to-apache-tika – kittu Jun 18 '15 at 05:53
  • I found I had to decorate the stream with new BufferedInputStream(stream) (if not already an instanceof BufferedInputStream) for this to work, otherwise I got 'java.io.IOException: mark/reset not supported' in some cases. – mjj1409 Aug 14 '15 at 00:53
  • 2
    @mjj1409 It's normally simpler to wrap it in a `TikaInputStraem` instead, that handles all the things for you! The [javadocs detail](http://tika.apache.org/1.10/api/org/apache/tika/detect/Detector.html) the stream requirement for mark/reset, plenty of streams do support that (including TikaInputStream and BufferedInputStream) – Gagravarr Aug 14 '15 at 06:02
  • 1
    Since Tika 2.0.0 Metadata.RESOURCE_NAME_KEY has been renamed TikaCoreProperties.RESOURCE_NAME_KEY. Source: https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 – Caponte Jan 28 '22 at 13:39