0

Is there a way, to configure Apache Tika, to parse data in chunks ? Let's say Data is divided in 10 chunks. Can it parse each chunk as it receives it ? Or it can only parse when it gets all 10 chunks ?

public OutputStream parse(InputStream instream) {
        OutputStream outstream = new ByteArrayOutputStream();
        ToXMLContentHandler h = new ToXMLContentHandler();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        Metadata metadata = new Metadata();
        XHTMLContentHandler h1 = new XHTMLContentHandler(h, metadata);
        try {
            parser.parse(instream, h1, metadata, context);
            outstream.write(h1.toString().getBytes(Charset.forName("UTF-8")));
        } catch (TikaException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return outstream;
    }

Any ideas on this?

Dusan Bajic
  • 10,249
  • 3
  • 33
  • 43
  • Apache Tika takes an InputStream, that's the interface. So, write your own chunk based input stream? – Gagravarr Jan 11 '19 at 04:09
  • With a custom stream which gets data in chunks, how'd it work when code gets 5 chunks at once, then there's a lag for say 5 seconds, for the next chunks. I don't want it to wait to get all to parse, rather start parsing, with whatever it has. –  Jan 11 '19 at 21:44
  • Most file formats require the whole file to be able to parse, as they need random access. For those that don't, just have it start parsing on what you have from a custom input stream, get the sax events from that, and block on read until you have more – Gagravarr Jan 12 '19 at 09:52

0 Answers0