I'm looking for something in Java to read in Word documents to process their text.. all I need is there text, nothing fancy. I know about Apache POI, however it doesn't include support for DOCX right now, anything out there?
Asked
Active
Viewed 9,918 times
4 Answers
5
If you don't require formatting information, images and all other fancy stuff, then the job is lot easier. Just some 5 to 10 lines of code will do.
- Treat DOCX as a zip file. It consists a bunch of files which includes 'document.xml'. Use ZipInputStream and extract that file alone. (you may use your favorite zip utility and open docx and see for yourself!)
- Use a SAX parser and read contents between node body/p/r/t - voila you got the text!
This is applicable only if you need the text only.

Joseph
- 877
- 8
- 20
-
Hi Joseph can you plz write down the short code here ? It would be of GREAT HELP to me... – Rahul Utb May 13 '11 at 20:28
3
With some googling I found OpenXML4J. This might solve your issue. I have not used this before I am sure someone in the community will have better insight.
Note: This is a duplicate question. This has the solution plus a bit of discussion. Link to the question.

Community
- 1
- 1

XanderLynn
- 883
- 3
- 16
- 29
-
1Is it reasonable to keep both questions, given that one is asking about Word doc format and the other Excel? They may be two subsets of one larger document format spec, I honestly don't know. – Bill the Lizard Feb 15 '10 at 05:40
-
I believe it is a duplicate because each question is asking about office 2007 java api. The other question, IMHO, does answer the mail. :) – XanderLynn Feb 15 '10 at 13:57
2
Try apache poi - it can handle doc, docx, xls, xlsx, ppt, pptx.
Another production-level solution is OpenOffice in headless mode which can even be used in a server-side scenario.

ccpizza
- 28,968
- 18
- 162
- 169
1
You could try docx4j; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java

JasonPlutext
- 15,352
- 4
- 44
- 84