0

I am trying to solve a very specific problem resulting from thousands of pages of MediaWiki wikicode which were incorrectly parsed into Confluence's XML syntax. Unfortunately, the original wikicode is no longer available, so the task at hand is to clean up the Confluence XML syntax.

Incorrect syntax:

<ac:link>
    <ri:page ri:content-title="File__image.jpg"/>
    <ac:plain-text-link-body><![CDATA[500px]]></ac:plain-text-link-body>
</ac:link>

Correct syntax:

<ac:image>
    <ri:attachment ri:filename="image.jpg"/>
</ac:image>

(The incorrect syntax was generated by the MediaWiki Converter plugin for the Universal Wiki Converter (UWC), which interpreted the wikicode [[File:image.jpg | 500px]] as the text "500px" linked to the page "File:image.jpg", instead of as the embedded image "image.jpg" with a width of 500px.)

I have had success with fixing the XML syntax using a chain of find/replace operations, but this is a very manual process using a set of kludges made with scripts/plugins in Sublime Text 3.

As a quick overview, here is the set of replacements I have been using:

  1. Remove <ac:link>
  2. Remove </ac:link>

  3. Replace <ri:page ri:content-title=" with <ac:image><ri:attachment ri:filename="

  4. Replace <ac:plain-text-link-body><![CDATA[500px]]></ac:plain-text-link-body> with </ac:image>

  5. Replace ri:filename="File__ with ri:filename="

Using my scripts does "work", but since I have demonstrated the value of fixing the Confluence XML syntax, I now have more time to invest in crafting a "good" solution using better methods. An ideal solution would be written in JavaScript (so that I can create a Greasemonkey script to fix the markup using Confluence's in-browser code editor) or perhaps in such a way that it could utilize Confluence's REST API.

I have some scripting experience (including PHP, JavaScript, and a little Python) and limited programming experience, and with my own research into existing scripts, I have not found an existing script which could be easily adapted to fit this purpose, so I am looking for suggestions/guidance on the best way to build a script to fit this need.

How might I build a script to chain together such a series of find/replace operations?

Daniel
  • 5
  • 3
  • Can you post full XML file or at least header and a few on top? There seems to be namespaces in document. – Parfait Jan 05 '16 at 02:33
  • Is the full XML file or header necessary? The need is to automate multiple find/replace operations, not parse XML. – Daniel Jan 05 '16 at 21:56
  • XSLT, the special purpose declarative language that can restructure/redesign XML files can run such find/replace needs and practically all general purpose languages including the ones you specified --PHP, Python-- can run XSLT (much like they can with SQL also a special purpose declarative language). – Parfait Jan 06 '16 at 01:13

0 Answers0