What is the diffrence between Tika app, Tika Server and Java Wrapper. Which one is used and when?

Question

I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)

How much of the processing will be Tika? And how much of a problem will it be when your JVM crashes or hangs? — Gagravarr, Mar 01 '18 at 22:23
Approximately, daily 5000 documents, each has 500 MB in size. — ismail josh, Mar 01 '18 at 22:25
500mb documents are rather large, how much info are you expecting out of them? — Gagravarr, Mar 02 '18 at 11:41
It depends on the file. In general 50-MB of text will be extracted. — ismail josh, Mar 02 '18 at 13:07

score 0 · Answer 1 · answered Oct 11 '21 at 22:58

Making PUT calls to a REST endpoint for sending thousands of 0.5 GB documents over HTTP, one at a time, is not an appropriate scenario for the Tika Server. It will not be memory efficient and the server will likely crash with some kind of memory leak or bugs.

Although as of v1.19 there is now a -spawnChild option to periodically restart the process after it has processed -maxFiles. From v2.x, this is now the default.

For your needs, you should simply use the tika-app in batch mode, which:

Runs locally, using an input and output directory that you specify
Sets up parent/child processes to robustly handle hangs/OOMEs
Runs multiple parser threads in parallel
Can restart child every x minutes or after y files to avoid memory leaks
Logs failures

java -jar tika-app.jar -i <input_directory> -o <output_dir>

What is the diffrence between Tika app, Tika Server and Java Wrapper. Which one is used and when?

1 Answers1