Apache Tika: Parsing only metadata without content extraction

Asked Feb 08 '12 at 10:43

Active Feb 08 '12 at 10:43

Viewed 2,774 times

I'm using Apache Tika for extracting metadata from documents. I'm mostly interested in setting up a basic dublin core, like Author, Title, Date, etc. I'm not interested in the content of the documents at all. Currently I'm simply doing the usual thing:

 FileInputStream fis = new FileInputStream( uploadedFileLocation );
 // Tika parsing
 Metadata metadata = new Metadata();
 ContentHandler handler = new BodyContentHandler();
 AutoDetectParser parser = new AutoDetectParser();
 parser.parse(fis, handler, metadata);

Is there some way to tell Tika to not parse the content? I'm hoping that this will speed things up as well as save memory.

asked Feb 08 '12 at 10:43

pokita

1,241
10
12

I think what you might want is the proposed enhancement in [TIKA-694](https://issues.apache.org/jira/browse/TIKA-694) - is that correct? – Gagravarr Feb 08 '12 at 12:38
Yes, exactly. So this pretty much answers my question... I'm now just wondering what would be the best method to simply pipe the content output to /dev/null, so that it isn't saved in the first place. – pokita Feb 08 '12 at 12:40
1

To answer myself here: I now use a BodyContentHandler with a NullOutputStream, so that the content simply gets thrown away. – pokita Feb 08 '12 at 13:44
You could also use the org.xml.sax.helpers.DefaultHandler base class from the JDK as a dummy handler that just ignores all extracted content. – Jukka Zitting Feb 08 '12 at 15:14
Yes, thanks. That's even easier. :-) – pokita Feb 08 '12 at 15:30
2

see http://apache-tika-users.1629097.n2.nabble.com/How-to-extract-only-metadata-td5416074.html – afruzan May 10 '17 at 11:09

Apache Tika: Parsing only metadata without content extraction

0 Answers0