0

We are trying to convert a .docx – and later other potential file formats – into a kind of standard XML. This XML is going to be mapped through an XSLT to the XML of our choice (xsd).

For the conversion to be successful, we need to keep as many of the information elements within the document as possible. The most important ones are the structure, the content, tables, lists, and figures (images etc) within the document.

We have realised that getting a document that this job is complex, and that there are serious restrictions to what kind of documents we can support.

As there are different standards, implementing a converter for each of them would be time demanding.

Does anyone have some experience with Document Conversion to XML? Any tips on how to proceed?

sbadea
  • 1
  • 2

1 Answers1

1

You are correct that converting from DOCX to an arbitrary XML format can be a big undertaking.

What we would like is to convert a .docx and other potential file formats into a standard XML which can through XSLT be transformed to a XML with a specified XSD.

A DOCX file is already in a standard XML format known as Office Open XML (OOXML). See Office Open XML Overview for an introduction.

We are aware that this is a complicated area. There will be restrictions on what kind of documents we will support, and the most important thing for us is that we can keep structure and content.

Given that OOXML is oriented toward formatting, depending upon which "structure and content" you're looking to identify, you may have a very challenging classification problem to solve. The problem would be hard enough knowing the exact target format; answering in the general case isn't feasible. One technique that can help is pattern-based matching of keywords, headings, etc, to identified the more structured parts of the target format within the source document.

kjhughes
  • 106,133
  • 27
  • 181
  • 240