How would I get a subset of Wikipedia's pages?

Question

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

score 4 · Accepted Answer · edited Jan 18 '21 at 12:38

4

The stackoverflow database is available for download.

edited Jan 18 '21 at 12:38

Community

1
1

answered Aug 24 '09 at 04:29

Alex

75,813
86
255
348

Here's a link to the latest download: http://blog.stackoverflow.com/category/cc-wiki-dump/ – Chris Aug 24 '09 at 22:50
are we talking about the same DB here? – Leo Sep 17 '15 at 19:02

score 3 · Answer 2 · answered Aug 24 '09 at 05:39

3

Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

answered Aug 24 '09 at 05:39

Jim Ferrans

30,582
12
56
83

You know, thats not a bad idea. It would give a nice subset. I'm worried that it'll simply take forever, that's my only issue. – Chris Aug 24 '09 at 06:28

score 1 · Answer 3 · edited Nov 24 '21 at 07:47

1

One option is to download the entire Wikipedia dump, and then use only part of it. You can either decompress the entire thing and then use a simple script to split the file into smaller files (e.g. here), or if you are worried about disk space, you can write a something a script that decompresses and splits on the fly, and then you can stop the decompressing process at any stage you want. Wikipedia Dump Reader can by your inspiration for decompressing and processing on the fly, if you're comfortable with python (look at mparser.py).

If you don't want to download the entire thing, you're left with the option of scraping. The Export feature might be helpful for this, and the wikipediabot was also suggested in this context.

edited Nov 24 '21 at 07:47

DisappointedByUnaccountableMod

6,656
4
18
22

answered Aug 24 '09 at 05:06

daphshez

9,272
11
47
65

1

Yeah, i'm in Australia, our internet download limits kinda preclude downloading the whole lot. Having said that, we're all getting fibre-to-the-home broadband (in a million years), and it'll send our country broke, so i could always wait for that? /rant – Chris Aug 24 '09 at 05:10
Right. Then look at the export feature. If I understand it correctly, it's less heavy on the servers and in bandwidth then crawling. – daphshez Aug 24 '09 at 07:41

score 0 · Answer 4 · edited Jan 18 '21 at 12:39

0

If you wanted to get a copy of the stackoverflow database, you could do that from the creative commons data dump.

Out of curiosity, what are you using all this data for?

edited Jan 18 '21 at 12:39

Community

1
1

answered Aug 24 '09 at 04:31

Mike Cooper

2,928
2
26
29

I want to experiment with implementing a mapreduce algorithm – Chris Aug 24 '09 at 04:32

score 0 · Answer 5 · answered Aug 24 '09 at 05:08

0

You could use a web crawler and scrape 100MB of data?

answered Aug 24 '09 at 05:08

score 0 · Answer 6 · answered Feb 24 '11 at 08:44

0

There are a lot of wikipedia dumps available. Why do you want to choose the biggest (english wiki)? Wikinews archives are much smaller.

answered Feb 24 '11 at 08:44

Danubian Sailor

1
38
145
223

score 0 · Answer 7 · answered Mar 12 '19 at 19:58

One smaller subset of Wikipedia articles comprises the 'meta' wiki articles. This is in the same XML format as the entire article dataset, but smaller (around 400MB as of March 2019), so it can be used for software validation (for example testing GenSim scripts).

https://dumps.wikimedia.org/metawiki/latest/

You want to look for any files with the -articles.xml.bz2 suffix.

How would I get a subset of Wikipedia's pages?

7 Answers7