I have done this with the older Word XML output. I did some study comparing the old Word XML with the new docx format. They are very, very similar. The fact that docx is a multi-file archive is not a problem for me, because I use Saxon XSLT running in java and I can use jar file URLs to open the word/document.xml file and from there get to all the other files with the document() XPath function.
I have found the trick to be to just cut to the chase, by extracting just what you need, essentially paragraphs, tables also convert pretty straight-forwardly to HTML tables. Use Style names and turn them into CSS. I demand that my source documents are built with Styles, and when it's just formatting bold, italics, font size, stuff like that, then I will not try to preserve all that exactly. I care about content, and HTML formatting can be rather different.
So, this is all fairly doable with XSLT, especially the old Word XML.
However, with docx there is one major loss of a really useful feature: the wx namespace. Especially:
- w:listPr/wx:t/@wv:val -- which gives you the section heading numbering strings for numbered sections
- wx:sub-section -- which you can transform to
<div>
elements to have nested sections instead of a flat list of headings and paragraphs.
I find particularly the reconstruction of the section numbers an immensely hard task if I want to do it correctly. The principles are described in Wordprocessing Numbering, Levels & Lists, the principle is not hard to understand. But it is pretty hard to implement, as you have to chase through levels of styles and w:basedOn parent styles, concrete number formats, abstract number formats, until you really gather the number format, and then you also must keep track of the counting of all the levels so that you have the numbers for each level that then you format.
I have done this sort of inheritance scheme in XSLT, it is even fun to do, but it is hard and would take me several days, time which I don't have.
The recovery of the nesting levels (wx:sub-section) is also non-trivial, and you have to sort of break out of normal XSLT workflows to make that happen. I have done such things too, but it's another few days I'd need to invest.
I often wonder when people say "oh, that wx namespace has been dropped, because the developers understand that it is redundant", yeah, but I doubt most of the people who say that so lightly have ever done these transformations.
I think docx is designed to be obtuse so that most of us foot-soldiers are intimidated and that the software companies like Microsoft and that Aspex Words, etc. stuff has a market share for bulky Windows-only dependent licensed software packages.