4

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

Chris
  • 39,719
  • 45
  • 189
  • 235

7 Answers7

4

The stackoverflow database is available for download.

Community
  • 1
  • 1
Alex
  • 75,813
  • 86
  • 255
  • 348
3

Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

Jim Ferrans
  • 30,582
  • 12
  • 56
  • 83
  • You know, thats not a bad idea. It would give a nice subset. I'm worried that it'll simply take forever, that's my only issue. – Chris Aug 24 '09 at 06:28
1

One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).

If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.

daphshez
  • 9,272
  • 11
  • 47
  • 65
  • 1
    Yeah, i'm in Australia, our internet download limits kinda preclude downloading the whole lot. Having said that, we're all getting fibre-to-the-home broadband (in a million years), and it'll send our country broke, so i could always wait for that? /rant – Chris Aug 24 '09 at 05:10
  • Right. Then look at the export feature. If I understand it correctly, it's less heavy on the servers and in bandwidth then crawling. – daphshez Aug 24 '09 at 07:41
0

If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.

Out of curiosity, what are you using all this data for?

Community
  • 1
  • 1
Mike Cooper
  • 2,928
  • 2
  • 26
  • 29
0

You could use a web crawler and scrape 100MB of data?

0

There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.

Danubian Sailor
  • 1
  • 38
  • 145
  • 223
0

One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).

https://dumps.wikimedia.org/metawiki/latest/

You want to look for any files with the -articles.xml.bz2 suffix.

Vineet Bansal
  • 491
  • 1
  • 4
  • 14