3

I'm using Tika to auto detect content type of documents being pushed into a DMS. Almost everything works just fine except for emails.

I have to discriminate between standard mail messages (mime => message/rfc822) and signed mail messages (mime => multipart/signed) but all emails get detected as message/rfc822.

The signed mail that doesn't get detected correctly has the following content type header:

Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="----4898E6D8BDE1929CA602BE94D115EF4C"

The java code I use for parsing is:

Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());
detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();

I'm referencing the core libraries and tika-parsers to detect also pdf and msword documents. Am I missing something else?

Nicola
  • 2,876
  • 2
  • 18
  • 26
  • Did you try upgrading to the latest version of Apache Tika? – Gagravarr Dec 11 '14 at 02:59
  • Yes, version 1.6 - I didn't find the binaries, just the sources on Tika site that I compiled with a Maven update first. Is there a newer version than 1.6? – Nicola Dec 11 '14 at 16:14
  • 1.7 was due to be released a little while ago, but has been delayed. If you're already building from source with maven, just checkout the latest (trunk) code from svn / git and build that to try the very latest version! – Gagravarr Dec 11 '14 at 19:48
  • I will try and I will let you know the results. Thanks – Nicola Dec 12 '14 at 08:37
  • I tried as suggested with the latest version (1.7) but still I can't discriminate between "message/rfc822" and "multipart/signed". Am I missing some detector? – Nicola Dec 16 '14 at 13:37
  • Maybe not, but if you're not on the latest version you won't get much sympathy! Next step is to identify two very small and public files, then [raise an issue in the Apache Tika JIRA bug tracker](https://issues.apache.org/jira/browse/TIKA) and upload both test files. We can then take it from there! – Gagravarr Dec 16 '14 at 13:49
  • I'll try to get some samples email and raise an issue. In the mean while I'm trying to do a custom detector using javax.mail. – Nicola Dec 17 '14 at 09:07
  • Once you've got some test files, please do create a new Tika JIRA and upload them! Your detector could be a good contribution too for that bug :) There's some related problems being tackled in [TIKA-879](https://issues.apache.org/jira/browse/TIKA-879), but I think this'll want to be a separate issue – Gagravarr Dec 24 '14 at 03:20
  • I will try to get some sample mails in the next days and upload them. – Nicola Dec 31 '14 at 12:17

1 Answers1

1

I resolved my problem. I've implemented a custom detector by implementing Detector interface:

public class MultipartSignedDetector implements Detector {

  @Override
  public MediaType detect(InputStream is, Metadata metadata) throws IOException {

    TemporaryResources tmp = new TemporaryResources();

    TikaInputStream tis = TikaInputStream.get(is, tmp);
    tis.mark(Integer.MAX_VALUE);

    try {

      MimeMessage mimeMessage = null;
      String host = "host.com";
      Properties properties = System.getProperties();
      properties.setProperty("mail.smtp.host", host);
      Session session = Session.getDefaultInstance(properties);

      mimeMessage = new MimeMessage(session, tis);

      if(mimeMessage.getContentType() != null && mimeMessage.getMessageID() != null && mimeMessage.getContentType().toLowerCase().contains("multipart/signed"))
        return new MediaType("multipart", "signed");
      else
        return MediaType.OCTET_STREAM;

    } catch(Exception e) {
      return MediaType.OCTET_STREAM;
    } finally {
      try {
        tis.reset();
        tmp.dispose();
      } catch (TikaException e) {
        // ignore
    }
  }
 }
}

And then add the custom detector to the composite detector just before the default one:

Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());

detectors.add(new MultipartSignedDetector());

detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();
Nicola
  • 2,876
  • 2
  • 18
  • 26
  • Rather than manually adding it as a new detector, wouldn't it be easier to add it to the auto-detected list? See [this for parsers](http://tika.apache.org/1.6/parser_guide.html#List_the_new_parser), but add to the detectors service file – Gagravarr Dec 18 '14 at 13:19