4

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files.

The code to extract we are using is as follows:

        var tikaConfig = TikaConfig.getDefaultConfig();
        var metadata = new Metadata();

        AutoDetectParser parser = new AutoDetectParser(tikaConfig);
        BodyContentHandler handler = new BodyContentHandler();

        using (TikaInputStream stream = TikaInputStream.get(new    File(filename), metadata))
        {
            parser.parse(stream, handler, metadata, new ParseContext());

            Array metadataKeys = metadata.names();
            Array.Sort(metadataKeys);
        }

With the above code sample, when we try to extract the metadata even the content is being read. We would need a way to avoid the same.

Venki
  • 2,129
  • 6
  • 32
  • 54
  • 1
    There's [an open Tika JIRA for this - TIKA-1351](https://issues.apache.org/jira/browse/TIKA-1351) - did you have a read of that and see the current workaround? – Gagravarr Jun 15 '16 at 10:51
  • I read the thread and it suggests only for some parsers like PDF there is an exclusion of parsing the content. Not for every type. Can you please elaborate on the workaround you are suggesting. Thanks! – Venki Jun 16 '16 at 08:37
  • I suppose that you are suggesting creating some DummyContentHandler to ignore parsing the contents during parsing ? – Venki Jun 16 '16 at 09:15
  • Our expectation is that we do not want the Tika to read the content at all during metadata extraction. Is there a way to do that ? We do not want to create a BodyContentHandler with null output stream since in that case the content is still read. – Venki Jun 16 '16 at 10:06
  • Did you find any solution for extracting only the metadata without reading the contents? – Santhosh May 04 '20 at 10:17

0 Answers0