0

I happened to know Tika, very useful in text extraction from word:

curl www.vit.org/downloads/doc/tariff.doc \ | java -jar tika-app-1.3.jar --text

But is there a way to use it to convert the Ms Word file into XML/HTML?

hmghaly
  • 1,411
  • 3
  • 29
  • 47

2 Answers2

1

Yes, it involves changing a whooping 4 characters in your command!

If you run java -jar tika-app-1.3.jar --help you'll get something that starts with:

usage: java -jar tika-app.jar [option...] [file|port...]

Options:
  -?  or --help          Print this usage message
  -v  or --verbose       Print debug level messages
  -V  or --version       Print the Apache Tika version number

  -g  or --gui           Start the Apache Tika GUI
  -s  or --server        Start the Apache Tika server
  -f  or --fork          Use Fork Mode for out-of-process extraction

  -x  or --xml           Output XHTML content (default)
  -h  or --html          Output HTML content
  -t  or --text          Output plain text content
  -T  or --text-main     Output plain text content (main content only)
  -m  or --metadata      Output only metadata
.....

From that, you'll see that if you change your --text option to --html or --xml you'll get out nicely formatted XML instead of just the plain text

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • Thanks, but is there a way to preserve document structure (tables etc, within the html/xml)? – hmghaly Apr 10 '13 at 09:23
  • For most of the file formats, it's already handled. Word is one of the ones where you'll paragraphs / tables / style names etc – Gagravarr Apr 10 '13 at 09:51
1

Despite the fact that this has been answered, since the op tagged the question with the java tag, for completeness I'll add reference to easily see how to do this in java.

The TikaTest.java superclass from Tika's unit tests is the easiest reference to convert word to html using the getXML method. It's a pity that they saw the usefulness of such an API in writing their unit tests, but chose not to expose it as a handy tool, forcing everyone to deal with handlers etc. which is unfortunate boilerplate for the common use case.

Daniel Gerson
  • 2,159
  • 1
  • 19
  • 29
  • 1
    If you [follow this example from the Tika website](http://tika.apache.org/1.11/examples.html#Parsing_to_XHTML), you'll see that getting XHTML is the same number of lines as getting plain text! – Gagravarr Dec 05 '15 at 23:02