1

I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)

1 Answers1

0

Making PUT calls to a REST endpoint for sending thousands of 0.5 GB documents over HTTP, one at a time, is not an appropriate scenario for the Tika Server. It will not be memory efficient and the server will likely crash with some kind of memory leak or bugs.

Although as of v1.19 there is now a -spawnChild option to periodically restart the process after it has processed -maxFiles. From v2.x, this is now the default.

For your needs, you should simply use the tika-app in batch mode, which:

  • Runs locally, using an input and output directory that you specify
  • Sets up parent/child processes to robustly handle hangs/OOMEs
  • Runs multiple parser threads in parallel
  • Can restart child every x minutes or after y files to avoid memory leaks
  • Logs failures
java -jar tika-app.jar -i <input_directory> -o <output_dir>
Amit Naidu
  • 2,494
  • 2
  • 24
  • 32