Parse .doc & .docx for get all text using golang?

Question

How can I parse word documents ".doc", ".docx" to get all the text using golang?

Why was this downvoted? its the first result from google.. – DMin Apr 19 '17 at 16:37 — DMin, Apr 19 '17 at 16:37

score 8 · Accepted Answer · answered Oct 22 '16 at 20:27

You can get some inspiration from those projects:

https://github.com/nguyenthenguyen/docx
https://github.com/opencontrol/doc-template

Basically, DOCX is a Zip file with XMLs in it. All the texts are inside document.xml

What both project do is remove all XML tags, leaving only text intact. You should see if that approach suits you too.

Tigran Rostomyan · Answer 2 · 2023-06-29T05:04:56.067

TL;DR

Unpack docx file using any Go zip package
Parse text from ‌word/document.xml
If you have any other docx files inside word/ folder, then repeat 1 and 2 steps for each of them recursively

In most of cases…

As already mentioned, docx file is a basically zip archive with bunch of ‌xml files inside.

In most of cases all text from original file present in ‌‌‌word/document.xml. You can use standard xml package for Go to parse text from it. And also look at OpenXML documentation if you need info about different tag types.

Upd. You can use this code btw.

But…

But unfortunately there are some cases when not all text is present in that file.

For example if document have another embedded docx file (or any other format) it most likely present within word folder (beside document.xml) as a separate file.

If that’s the case, you need to unpack each of that docx and parse their own document.xml.

For more details you can check AltChunk OpenXML class as well as any other related info.

score 0 · Answer 3 · answered Aug 22 '23 at 11:19

A very simple solution is to use https://github.com/sajari/docconv.

example code:

f, err := os.Open("path.docx")
if err != nil {
    panic(err)
}
defer f.Close()

var r io.Reader
r = f

tmpl, _, err := docconv.ConvertDocx(r)
if err != nil {
    return
}

this returns the docx as a string output.

Parse .doc & .docx for get all text using golang?

3 Answers3

TL;DR

In most of cases…

But…