I am trying to solve a very specific problem resulting from thousands of pages of MediaWiki wikicode which were incorrectly parsed into Confluence's XML syntax. Unfortunately, the original wikicode is no longer available, so the task at hand is to clean up the Confluence XML syntax.
Incorrect syntax:
<ac:link>
<ri:page ri:content-title="File__image.jpg"/>
<ac:plain-text-link-body><![CDATA[500px]]></ac:plain-text-link-body>
</ac:link>
Correct syntax:
<ac:image>
<ri:attachment ri:filename="image.jpg"/>
</ac:image>
(The incorrect syntax was generated by the MediaWiki Converter plugin for the Universal Wiki Converter (UWC), which interpreted the wikicode [[File:image.jpg | 500px]]
as the text "500px
" linked to the page "File:image.jpg
", instead of as the embedded image "image.jpg
" with a width of 500px
.)
I have had success with fixing the XML syntax using a chain of find/replace operations, but this is a very manual process using a set of kludges made with scripts/plugins in Sublime Text 3.
As a quick overview, here is the set of replacements I have been using:
- Remove
<ac:link>
Remove
</ac:link>
Replace
<ri:page ri:content-title="
with<ac:image><ri:attachment ri:filename="
Replace
<ac:plain-text-link-body><![CDATA[500px]]></ac:plain-text-link-body>
with</ac:image>
Replace
ri:filename="File__
withri:filename="
Using my scripts does "work", but since I have demonstrated the value of fixing the Confluence XML syntax, I now have more time to invest in crafting a "good" solution using better methods. An ideal solution would be written in JavaScript (so that I can create a Greasemonkey script to fix the markup using Confluence's in-browser code editor) or perhaps in such a way that it could utilize Confluence's REST API.
I have some scripting experience (including PHP, JavaScript, and a little Python) and limited programming experience, and with my own research into existing scripts, I have not found an existing script which could be easily adapted to fit this purpose, so I am looking for suggestions/guidance on the best way to build a script to fit this need.
How might I build a script to chain together such a series of find/replace operations?