0

I am using the Wikipedia Dataset to perform mapreduce. The dataset am using is(Wikipedia Wiki namespace) from here. The data in the bz2 file is like this

REVISION 724 234015 Wikipedia:Adding_Wikipedia_articles_to_Nupedia 2001-03-28T22:33:49Z ip:Larry_Sanger ip:Larry_Sanger
CATEGORY
IMAGE
MAIN Larry_Sanger LMS Adding_Nupedia_articles_to_Wikipedia Jimbo_Wales Nupedia Wikipedia
TALK
USER
USER_TALK
OTHER
EXTERNAL http://www.nupedia.com/write.shtml http://www.nupedia.com/policy.shtml http://www.nupedia.com/newsystem/signup.phtml http://www.nupedia.com/newsystem/writearticle.phtml?instr=on http://www.nupedia.com/editors.phtml
TEMPLATE
COMMENT *
MINOR 0
TEXTDATA 685

REVISION 724 431753 Wikipedia:Adding_Wikipedia_articles_to_Nupedia 2002-05-19T17:36:09Z Eclecticology 372
CATEGORY
IMAGE
MAIN Larry_Sanger LMS LMS Adding_Nupedia_articles_to_Wikipedia Jimbo_Wales Nupedia Wikipedia Mores Adding_Wikipedia_articles_to_Nupedia/Help
TALK
USER
USER_TALK
OTHER
EXTERNAL http://www.nupedia.com/write.shtml http://www.nupedia.com/policy.shtml http://chalkboard.nupedia.com http://www.nupedia.com/newsystem/signup.phtml http://www.nupedia.com/newsystem/writearticle.phtml?instr=on http://www.nupedia.com/editors.phtml
TEMPLATE
COMMENT "mores" linked; -/Talk
MINOR 1
TEXTDATA 738

Basically I want to transform each revision into one row so that one set of revision with all the other details are in a single row. I tried following something similar to this but its not working. Could someone guide me as to how to go about it?

Community
  • 1
  • 1
Warlord
  • 75
  • 2
  • 3
  • 12

1 Answers1

0

The easiest (probably not the most elegant) way to preprocess the data. Based on your link we're talking about 18GB that's doable. And anyway you have to separate the data from the schema (it seems the data contains the filed names too).

A nicer solution to write your own loader for this type of data. Here you'll find some example project and a tutorial http://help.mortardata.com/technologies/pig/write_your_own

kecso
  • 2,387
  • 2
  • 18
  • 29