I'm playing around with the dbpedia extraction framework. It seems very nice, and I'm happily building ASTs of wikipedia pages and extracting links (using WikiParser). However although I get a nice structured tree from the parse, I notice that the text nodes still contain lots of formatting markup (e.g. apostrophes used for italicisation, bolding etc.). For my purposes these are not helpful - I just want the plain text.
I can spend some time writing my own code to strip this out, but I'm presuming that something like this would be useful for dbpedia - and that it exists somewhere in the library. Am I right? And if so - where is the extra functionality to strip down to bare text?
Otherwise - does anyone know of any other (preferably scala) packages to strip out mediawiki markup?
Edit
In response to a request for greater detail. The following markup:
''An italicised '''bit''' of text'', <b>Some markup</b>
Comes through dbpedia as contents of a TextNode but untouched. I would like the ability either to strip it down to:
An italicised bit of text, Some markup
Or possibly to a more structured AST with additional nodes representing each section of raw text, perhaps annotated (on each node) with the type of formatting to be applied (e.g. italics, bold etc).
As is, the end result of a dbpedia parse is still quite full of markup.
Hope that helps.