I'm using Tika to auto detect content type of documents being pushed into a DMS. Almost everything works just fine except for emails.
I have to discriminate between standard mail messages (mime => message/rfc822) and signed mail messages (mime => multipart/signed) but all emails get detected as message/rfc822.
The signed mail that doesn't get detected correctly has the following content type header:
Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="----4898E6D8BDE1929CA602BE94D115EF4C"
The java code I use for parsing is:
Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());
detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();
I'm referencing the core libraries and tika-parsers to detect also pdf and msword documents. Am I missing something else?