-1

We got some really old .doc documents. Normally we use tika (our application normally does a text extract and then a PDF/A convert) but apparently msword2 (and msword5) are not supported currently. The only alternative I found was Libreoffice commandline. Is there anything else?

Searching for this is quite hard since everyone else seems to be looking for "old" as in 1995< and not <1991

Zanndorin
  • 360
  • 3
  • 15
  • You could use a Word macro to convert your older .doc format files to .docx format before processing them with your tika software. You may need to install the Word 6 converter and/or the Word DOS converter available from http://www.gmayor.com/downloads.htm – macropod May 16 '18 at 05:02

1 Answers1

0

We have looked into the issue a bit more and it seems that the only answer is that we need to use some version of the libwps library (which is the same LibreOffice uses).

We will look into the pros and cons of using Libreoffice commandline or the library itself and will probably just create a microservice for our application to use.

Zanndorin
  • 360
  • 3
  • 15