6

How can I parse word documents ".doc", ".docx" to get all the text using golang?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Alexander Barac
  • 125
  • 1
  • 1
  • 5

3 Answers3

8

You can get some inspiration from those projects:

https://github.com/nguyenthenguyen/docx
https://github.com/opencontrol/doc-template

Basically, DOCX is a Zip file with XMLs in it. All the texts are inside document.xml

What both project do is remove all XML tags, leaving only text intact. You should see if that approach suits you too.

Alexey Soshin
  • 16,718
  • 2
  • 31
  • 40
0

TL;DR

  1. Unpack docx file using any Go zip package
  2. Parse text from ‌word/document.xml
  3. If you have any other docx files inside word/ folder, then repeat 1 and 2 steps for each of them recursively

In most of cases…

As already mentioned, docx file is a basically zip archive with bunch of ‌xml files inside.

In most of cases all text from original file present in ‌‌‌word/document.xml. You can use standard xml package for Go to parse text from it. And also look at OpenXML documentation if you need info about different tag types.

Upd. You can use this code btw.

But…

But unfortunately there are some cases when not all text is present in that file.

For example if document have another embedded docx file (or any other format) it most likely present within word folder (beside document.xml) as a separate file.

If that’s the case, you need to unpack each of that docx and parse their own document.xml.

For more details you can check AltChunk OpenXML class as well as any other related info.

0

A very simple solution is to use https://github.com/sajari/docconv.

example code:

f, err := os.Open("path.docx")
if err != nil {
    panic(err)
}
defer f.Close()

var r io.Reader
r = f

tmpl, _, err := docconv.ConvertDocx(r)
if err != nil {
    return
}

this returns the docx as a string output.

Roman Sterlin
  • 1,485
  • 3
  • 10
  • 25