5

What?:
I'm trying to get page-to-page link map (matrix) of wikipedia pages by page_id in following format:

from1 to1 to2 to3 ...
from2 to1 to2 to3 ...
...

Why?:
I'm looking for data set (pages from wikipedia) to try out PageRank.

Problem:
At dumps.wikimedia.org it is possible to download pages-articles.xml which is XML with this kind of format:

<page>
  <title>...</title>
  <id>...</id>          // pageid
  <text>...</text>
</page>

that I will use for retrieving articles (text), then also base per-page data (page.sql) which contains some details about pages by page_id and last one that seems relevant to me is pagelinks.sql that contains page-to-page link records. Problem is that pagelinks table has following fields: pl_from, pl_namespace and pl_title.

Idea: Create temporary database, import page and pagelinks tables and create this matrix by using pagelinks table and retrieving page_ids according to pl_titles. Possible solution:

SELECT pl_from, GROUP_CONCAT(page_id SEPARATOR ' ') FROM pagelinks
    JOIN page ON 
        pl_title = page_title AND pl_namespace = page_namespace
GROUP BY pl_from

or for getting map of "backlinks" (to1 from1 from2 from3 ..., not from1 to1 to2 to3 ...):

SELECT page_id, GROUP_CONCAT(pl_from SEPARATOR ' ') FROM pagelinks
    JOIN page ON 
        pl_title = page_title AND pl_namespace = page_namespace
GROUP BY page_id

Question:
Is there a place where to get this kind of matrix of page-to-page links by page_id so that I don't need to create it on my own ? Or if not, is there any faster way how to get this kind of matrix than solution that I've pointed out?

LihO
  • 41,190
  • 11
  • 99
  • 167

2 Answers2

1

I think importing both tables into a database and using that is probably the easiest way to do this (and you don't need pages-articles.xml for that).

Another option would be to manually parse the two sql files without importing them to database. I actually created a library to do exactly that in .Net.

svick
  • 236,525
  • 50
  • 385
  • 514
  • Then maybe you need some indexes? I'm not sure whether the sql files create them. – svick Oct 01 '12 at 18:35
  • There are indexes already, I think it's limited by speed of my disc... too bad that `page_id` is not part of `pagelinks` table already... never mind, thank you for help anyway :) – LihO Oct 01 '12 at 18:44
0

Well, to parse XML files i guess you could try out this link:

http://haselgrove.id.au/wikipedia.htm

There's an approach, plus a couple of pre-made .m(MATLAB,OCTAVE) files that can clean your dump. Also, there is a pre-processes dump in a .txt file. Though, it's pretty old. 2009 or so.

Ankit Vadehra
  • 157
  • 1
  • 11