How can I parse word documents ".doc", ".docx" to get all the text using golang?
-
6Why was this downvoted? its the first result from google.. – DMin Apr 19 '17 at 16:37
3 Answers
You can get some inspiration from those projects:
https://github.com/nguyenthenguyen/docx
https://github.com/opencontrol/doc-template
Basically, DOCX is a Zip file with XMLs in it.
All the texts are inside document.xml
What both project do is remove all XML tags, leaving only text intact. You should see if that approach suits you too.

- 16,718
- 2
- 31
- 40
TL;DR
- Unpack
docx
file using any Go zip package - Parse text from
word/document.xml
- If you have any other
docx
files insideword/
folder, then repeat 1 and 2 steps for each of them recursively
In most of cases…
As already mentioned, docx
file is a basically zip
archive with bunch of xml
files inside.
In most of cases all text from original file present in word/document.xml
. You can use standard xml
package for Go to parse text from it. And also look at OpenXML documentation if you need info about different tag types.
Upd. You can use this code btw.
But…
But unfortunately there are some cases when not all text is present in that file.
For example if document have another embedded docx
file (or any other format) it most likely present within word
folder (beside document.xml
) as a separate file.
If that’s the case, you need to unpack each of that docx
and parse their own document.xml
.
For more details you can check AltChunk
OpenXML class as well as any other related info.

- 31
- 1
- 4
A very simple solution is to use https://github.com/sajari/docconv.
example code:
f, err := os.Open("path.docx")
if err != nil {
panic(err)
}
defer f.Close()
var r io.Reader
r = f
tmpl, _, err := docconv.ConvertDocx(r)
if err != nil {
return
}
this returns the docx as a string output.

- 1,485
- 3
- 10
- 25