3

I'm playing around with the dbpedia extraction framework. It seems very nice, and I'm happily building ASTs of wikipedia pages and extracting links (using WikiParser). However although I get a nice structured tree from the parse, I notice that the text nodes still contain lots of formatting markup (e.g. apostrophes used for italicisation, bolding etc.). For my purposes these are not helpful - I just want the plain text.

I can spend some time writing my own code to strip this out, but I'm presuming that something like this would be useful for dbpedia - and that it exists somewhere in the library. Am I right? And if so - where is the extra functionality to strip down to bare text?

Otherwise - does anyone know of any other (preferably scala) packages to strip out mediawiki markup?

Edit

In response to a request for greater detail. The following markup:

''An italicised '''bit''' of text'', <b>Some markup</b>

Comes through dbpedia as contents of a TextNode but untouched. I would like the ability either to strip it down to:

 An italicised bit of text, Some markup

Or possibly to a more structured AST with additional nodes representing each section of raw text, perhaps annotated (on each node) with the type of formatting to be applied (e.g. italics, bold etc).

As is, the end result of a dbpedia parse is still quite full of markup.

Hope that helps.

Nemo
  • 2,441
  • 2
  • 29
  • 63
Alex Wilson
  • 6,690
  • 27
  • 44
  • Interesting question. I've been so bold as to add the Java tag, assuming you wouldn't mind calling a Java library from Scala. Good luck! – Fred Foo Mar 04 '11 at 15:54
  • to help you get answers you could paste some meaningful snippet of the text node content that you're trying to transform to plain text. Also, what exact object/structure do you get it in? Is it a dbpedia-specific structure or java? If dbpedia, you could provide a link to the javadoc. – huynhjl Mar 04 '11 at 16:15
  • @huynhjil, The data structures from dbpedia that I'm traversing (and that I've created) are those summarised here: http://wiki.dbpedia.org/DeveloperDocumentation/WikiParser?v=hdy. Javadoc detail is linked off this page. – Alex Wilson Mar 04 '11 at 19:31

3 Answers3

2

So a quick look at the SimpleWikiParser source code on sourceforge suggests that as of 1/29/2011 the parser handles the following entities:

  • comments
  • references
  • code blocks
  • internal links and external links
  • properties
  • tables.

Presumably all wiki other content ends up in TextNode objects. Looking at the wiki markup feature set, there would be a non trivial amount of work to strip out the wiki syntax elements let alone convert them further into structured elements.

For alternative or code you can leverage, look at the following Alternate Parsers page.

For a self contained but imperfect solution, you could perform a bunch of regular expression replace on node.text.

huynhjl
  • 41,520
  • 14
  • 105
  • 158
  • Thanks for looking into this. I've had a look through the alternate parsers, but they mostly seem like aborted incomplete projects. Possibly neccessarily so given the main spec for Mediawiki markup parsing is the original php parser. I was hoping that dbpedia would have needed to grasp the nettle fully for their purposes - but I guess they're not currently relying on the main article text body as the major source of their semantic information... It's looking like I need to go for self-contained but imperfect. – Alex Wilson Mar 04 '11 at 22:07
1

The gwtwiki (bliki) project handles mediawiki formatting -> pdf/html/etc. It is a fairly complete framework for parsing and reformatting mediawiki text.

Erik
  • 11
  • 1
0

You can start this process by using WikiUtil.removeWikiEmphasis and adding a few extra rules.

In my case, I map the text to toWikiText and link nodes to their destination name.

case text:TextNode => text.toWikiText
case link:LinkNode => {
link match {
   case external:ExternalLinkNode =>  (external.destination.toString)
   case internal:InternalLinkNode =>  (internal.destination.decodedWithNamespace)
   case inter:InterWikiLinkNode   =>  (inter.destination.decodedWithNamespace)
}
tommy chheng
  • 9,108
  • 9
  • 55
  • 72