I happened to know Tika, very useful in text extraction from word:
curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text
But is there a way to use it to convert the Ms Word file into XML/HTML?
I happened to know Tika, very useful in text extraction from word:
curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text
But is there a way to use it to convert the Ms Word file into XML/HTML?
Yes, it involves changing a whooping 4 characters in your command!
If you run java -jar tika-app-1.3.jar --help
you'll get something that starts with:
usage: java -jar tika-app.jar [option...] [file|port...]
Options:
-? or --help Print this usage message
-v or --verbose Print debug level messages
-V or --version Print the Apache Tika version number
-g or --gui Start the Apache Tika GUI
-s or --server Start the Apache Tika server
-f or --fork Use Fork Mode for out-of-process extraction
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-T or --text-main Output plain text content (main content only)
-m or --metadata Output only metadata
.....
From that, you'll see that if you change your --text
option to --html
or --xml
you'll get out nicely formatted XML instead of just the plain text
Despite the fact that this has been answered, since the op tagged the question with the java tag, for completeness I'll add reference to easily see how to do this in java.
The TikaTest.java superclass from Tika's unit tests is the easiest reference to convert word to html using the getXML method. It's a pity that they saw the usefulness of such an API in writing their unit tests, but chose not to expose it as a handy tool, forcing everyone to deal with handlers etc. which is unfortunate boilerplate for the common use case.