6

I have project about transforming Word DOCX XML (OOXML) files to HTML format.

I use XML Spy and XSLT, XPath, XML for this transformation.

Imagine a single Word file that I write a program in XSLT and transform it. But my supervisor says that if i change a value in the file that approach won't work.

I agree with that because I specify the code just for that document because I know what contains in it.

But, how do we write a general code in XSLT to transform all the Word files as well-formed HTML document (since a word document can be a lot different than each other)?

The problem is that I am trying to do it with XSLT? Is something wrong here isn't there? Or am i just being so chaotic about that.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Sojimanatsu
  • 619
  • 11
  • 28

3 Answers3

7

Your plan to use XSLT to transform DOCX files to HTML is fundamentally sound. XSLT is ideal for this purpose as it is well suited for mapping from XML to XML (or (X)HTML).

Your challenge will be that the XML underlying DOCX is complex. Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference alone is over 5K pages long. If you know XML, XML namespaces, XSLT, HTML, and CSS well, you'll "just" have to learn some basics of OOXML to get started.

The concern about changing a value won't matter if you do this robustly and fundamentally understand OOXML. Start with the notion of runs of text in paragraphs: w:t, w:r and w:p.

Eric White has written extensively on OOXML in general and even transforming it to HTML specifically. See Transforming Open XML WordprocessingML to XHtml for excellent articles and examples.

ruffin
  • 16,507
  • 9
  • 88
  • 138
kjhughes
  • 106,133
  • 27
  • 181
  • 240
0

I have done this with the older Word XML output. I did some study comparing the old Word XML with the new docx format. They are very, very similar. The fact that docx is a multi-file archive is not a problem for me, because I use Saxon XSLT running in java and I can use jar file URLs to open the word/document.xml file and from there get to all the other files with the document() XPath function.

I have found the trick to be to just cut to the chase, by extracting just what you need, essentially paragraphs, tables also convert pretty straight-forwardly to HTML tables. Use Style names and turn them into CSS. I demand that my source documents are built with Styles, and when it's just formatting bold, italics, font size, stuff like that, then I will not try to preserve all that exactly. I care about content, and HTML formatting can be rather different.

So, this is all fairly doable with XSLT, especially the old Word XML.

However, with docx there is one major loss of a really useful feature: the wx namespace. Especially:

  • w:listPr/wx:t/@wv:val -- which gives you the section heading numbering strings for numbered sections
  • wx:sub-section -- which you can transform to <div> elements to have nested sections instead of a flat list of headings and paragraphs.

I find particularly the reconstruction of the section numbers an immensely hard task if I want to do it correctly. The principles are described in Wordprocessing Numbering, Levels & Lists, the principle is not hard to understand. But it is pretty hard to implement, as you have to chase through levels of styles and w:basedOn parent styles, concrete number formats, abstract number formats, until you really gather the number format, and then you also must keep track of the counting of all the levels so that you have the numbers for each level that then you format.

I have done this sort of inheritance scheme in XSLT, it is even fun to do, but it is hard and would take me several days, time which I don't have.

The recovery of the nesting levels (wx:sub-section) is also non-trivial, and you have to sort of break out of normal XSLT workflows to make that happen. I have done such things too, but it's another few days I'd need to invest.

I often wonder when people say "oh, that wx namespace has been dropped, because the developers understand that it is redundant", yeah, but I doubt most of the people who say that so lightly have ever done these transformations.

I think docx is designed to be obtuse so that most of us foot-soldiers are intimidated and that the software companies like Microsoft and that Aspex Words, etc. stuff has a market share for bulky Windows-only dependent licensed software packages.

Gunther Schadow
  • 1,490
  • 13
  • 22
0

You can also use pandoc - https://pandoc.org - it converts from docx to other formats.

ManuelGomes
  • 159
  • 1
  • 1
  • 6