I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)
Asked
Active
Viewed 418 times
1
-
1How much of the processing will be Tika? And how much of a problem will it be when your JVM crashes or hangs? – Gagravarr Mar 01 '18 at 22:23
-
Approximately, daily 5000 documents, each has 500 MB in size. – ismail josh Mar 01 '18 at 22:25
-
500mb documents are rather large, how much info are you expecting out of them? – Gagravarr Mar 02 '18 at 11:41
-
It depends on the file. In general 50-MB of text will be extracted. – ismail josh Mar 02 '18 at 13:07
1 Answers
0
Making PUT calls to a REST endpoint for sending thousands of 0.5 GB documents over HTTP, one at a time, is not an appropriate scenario for the Tika Server. It will not be memory efficient and the server will likely crash with some kind of memory leak or bugs.
Although as of v1.19 there is now a -spawnChild
option to periodically restart the process after it has processed -maxFiles
. From v2.x, this is now the default.
For your needs, you should simply use the tika-app
in batch mode, which:
- Runs locally, using an input and output directory that you specify
- Sets up parent/child processes to robustly handle hangs/OOMEs
- Runs multiple parser threads in parallel
- Can restart child every x minutes or after y files to avoid memory leaks
- Logs failures
java -jar tika-app.jar -i <input_directory> -o <output_dir>

Amit Naidu
- 2,494
- 2
- 24
- 32