MSword to XML/HTML using Apache Tika

Question

I happened to know Tika, very useful in text extraction from word:

curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text

But is there a way to use it to convert the Ms Word file into XML/HTML?

score 1 · Accepted Answer · answered Apr 10 '13 at 09:15

Yes, it involves changing a whooping 4 characters in your command!

If you run java -jar tika-app-1.3.jar --help you'll get something that starts with:

usage: java -jar tika-app.jar [option...] [file|port...]

Options:
  -?  or --help          Print this usage message
  -v  or --verbose       Print debug level messages
  -V  or --version       Print the Apache Tika version number

  -g  or --gui           Start the Apache Tika GUI
  -s  or --server        Start the Apache Tika server
  -f  or --fork          Use Fork Mode for out-of-process extraction

  -x  or --xml           Output XHTML content (default)
  -h  or --html          Output HTML content
  -t  or --text          Output plain text content
  -T  or --text-main     Output plain text content (main content only)
  -m  or --metadata      Output only metadata
.....

From that, you'll see that if you change your --text option to --html or --xml you'll get out nicely formatted XML instead of just the plain text

Thanks, but is there a way to preserve document structure (tables etc, within the html/xml)? — hmghaly, Apr 10 '13 at 09:23
For most of the file formats, it's already handled. Word is one of the ones where you'll paragraphs / tables / style names etc — Gagravarr, Apr 10 '13 at 09:51

score 1 · Answer 2 · answered Dec 05 '15 at 17:40

Despite the fact that this has been answered, since the op tagged the question with the java tag, for completeness I'll add reference to easily see how to do this in java.

The TikaTest.java superclass from Tika's unit tests is the easiest reference to convert word to html using the getXML method. It's a pity that they saw the usefulness of such an API in writing their unit tests, but chose not to expose it as a handy tool, forcing everyone to deal with handlers etc. which is unfortunate boilerplate for the common use case.

If you [follow this example from the Tika website](http://tika.apache.org/1.11/examples.html#Parsing_to_XHTML), you'll see that getting XHTML is the same number of lines as getting plain text! — Gagravarr, Dec 05 '15 at 23:02

MSword to XML/HTML using Apache Tika

2 Answers2