4

is there any way we can convert MS-Word and powerpoint data and metadata into xml using pipeline feature of CPF..?

Thanks in advance

Saahil Gupta
  • 209
  • 2
  • 6
  • What versions of word and PowerPoint are you wanting to covert? – Tyler Replogle May 04 '16 at 10:47
  • version 13 and above – Saahil Gupta May 04 '16 at 10:51
  • I have created the pipeline: Convert Pipeline Pipeline to test CPF /MarkLogic/cpf/actions/success-action.xqy /MarkLogic/cpf/actions/failure-action.xqy converting word to xml format http://marklogic.com/states/initial http://marklogic.com/states/done – Saahil Gupta May 04 '16 at 10:54
  • http://marklogic.com/states/error /convert-word-xml.xqy /convert-word-xml.xqy But stuck on what to write in convert-word-xml.xqy file, which will actually do the conversion – Saahil Gupta May 04 '16 at 10:57

2 Answers2

5

There are already pipelines to handle processing the zipped XML form of MS Office. Attach the pipelines "Office OpenXML Extract" and "WordprocessingML Process" to your domain. You won't get the full upconversion to DocBook that you would from the binary (.doc) MS Word docs, but we do tidy up the XML somewhat and you can add your own transforms onto the end.

Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
mholstege
  • 4,902
  • 11
  • 7
3

The short answer is Yes you can covert to XML.

The longer answer is, it depends on what version. Any version passed word 2007 is already in an XML format. It's just zipped up and has serval XML documents in them. The same is true for PowerPoint. The format of that XML can be tricky and you will most likely want to covert it to a cleaner version.

Also the latest version of word had a new schema so the format of the XML will be different.

You could start by seeing what xdmp:word-convert will give you. If that doesn't work well enough, you could write your own using xdmp:zip-get. Since the word file its self is a zip file you can call that and learn the way the docx is put together and decide how it should be coverted.

For this to work with CPF you will have to write your own action module and configure the CPF pipeline to have it has a step.

Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
Tyler Replogle
  • 1,339
  • 7
  • 13