I have wrote an application that must parse and retrieve some data from a few thousands large docx files. It will run on a high-performance production server with many CPUs, large amount of RAM and fast SSDs in RAID arrays, so obviously I want to fully use all available performance capabilities.
I found out that my application successfully do any other job in many concurrent threads, but it fails to concurrently parse many docx files using docx4j library. Moreover, this library can't safely support in many separate threads more than one instance of WordprocessingMLPackage class that contains a data from a docx file.
Googling and examination of a source code of the library confirm that it is totally not thread-safe (its classes, for example, contain many static fields and instances that cannot be used concurrently).
So I have some questions to ask:
- Is there any other libraries with the same capabilities that are guaranteed to be thread-safe?
- Can I launch my workers in some separate processes instead of separate threads to workaround this issue? If so, how badly will it decrease a performance of my application?