3

I have wrote an application that must parse and retrieve some data from a few thousands large docx files. It will run on a high-performance production server with many CPUs, large amount of RAM and fast SSDs in RAID arrays, so obviously I want to fully use all available performance capabilities.

I found out that my application successfully do any other job in many concurrent threads, but it fails to concurrently parse many docx files using docx4j library. Moreover, this library can't safely support in many separate threads more than one instance of WordprocessingMLPackage class that contains a data from a docx file.

Googling and examination of a source code of the library confirm that it is totally not thread-safe (its classes, for example, contain many static fields and instances that cannot be used concurrently).

So I have some questions to ask:

  • Is there any other libraries with the same capabilities that are guaranteed to be thread-safe?
  • Can I launch my workers in some separate processes instead of separate threads to workaround this issue? If so, how badly will it decrease a performance of my application?
user1764823
  • 425
  • 2
  • 8
  • 16
  • The objective is to be thread safe. Any issues will be addressed. What in particular are you having problems with? – JasonPlutext Jun 05 '13 at 22:21
  • Could one create multiple 'instances' of the library (each with its own isolated static fields) by leveraging classloaders? I'm no CL expert but sounds feasible – deprecated Jun 06 '13 at 07:39
  • @JasonPlutext I had some problems but I found the solution already. – user1764823 Jun 07 '13 at 08:19
  • 2
    Please describe your problems more fully and solution for the benefit of anyone who finds this thread. Thank you. – JasonPlutext Jun 07 '13 at 23:44

1 Answers1

4

I don't know of an alternative thread-safe library.

Launching your workers in separate processes is a viable workaround - there will be a higher startup cost than were you to use separate threads, but this probably won't be significant if you have a large number of files to process. You'll need some way for the processes to communicate, one option is to use Redis - use SETNX to atomically add a file name to the key-value store, if the set was successful then the worker can process the file, and if the set was unsuccessful then another process is already working on the file. Another option is to have a manager process assign files to the worker processes via sockets.

Zim-Zam O'Pootertoot
  • 17,888
  • 4
  • 41
  • 69