3

I downloaded the Wikipedia Pagelinks dataset (available on Wiki Dumps - http://dumps.wikimedia.org/enwiki/20140102/). I want to run PageRank algorithm on the dataset, however, I am unable to parse the data because it is not very well documented.

This is a sample of the dataset downloaded. The fields given are p1_from, p1_namespace, and p1_title. Looking online, p1_namespace is a number that denotes the type of article, but I do not know what p1_from is. To implement the pagerank algorithm, I want the number of articles that link to a particular article, however, I do not know what p1_from stands for. By its name, it sounds like it is the number of links that go away from that article, and not the other way around. Is this the case? And also, if it is, how can I reverse the graph given the data, so I can find the correct numbers.

DROP TABLE IF EXISTS `pagelinks`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `pagelinks` (
  `pl_from` int(8) unsigned NOT NULL DEFAULT '0',
  `pl_namespace` int(11) NOT NULL DEFAULT '0',
  `pl_title` varbinary(255) NOT NULL DEFAULT '',
  UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`),
  KEY `pl_namespace` (`pl_namespace`,`pl_title`,`pl_from`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `pagelinks`
--

/*!40000 ALTER TABLE `pagelinks` DISABLE KEYS */;
INSERT INTO `pagelinks` VALUES (10,0,'Computer_accessibility'),(12,0,'-ism'),(12,0,'1848_Revolution'),(12,0,'1917_October_Revolution'),

(12,0,'1919_United_States_anarchist_bombings'),(12,0,'19th_century_philosophy'),
(12,0,'6_February_1934_crisis'),(12,0,'A._K._Press'),(12,0,'A._S._Neill'),(12,0,'AK_Press'),(12,0,'A_Greek–English_Lexicon'),(12,0,'A_Language_Older_Than_Words'),
(12,0,'A_Vindication_of_Natural_Society'),(12,0,'A_las_Barricadas'),(12,0,'Abbie_Hoffman'),(12,0,'Absolute_idealism'),(12,0,'Abstentionism'),(12,0,'Action_theory_(philosophy)'),
(12,0,'Adam_Smith'),(12,0,'Adolf_Brand'),(12,0,'Adolf_Hitler'),(12,0,'Adolphe_Thiers'),(12,0,'Aesthetic_emotions'),(12,0,'Aesthetics'),(12,0,'Affinity_group'),(12,0,'Affinity_groups'),
(12,0,'African_philosophy'),(12,0,'Against_Civilization:_Readings_and_Reflections'),(12,0,'Against_His-Story,_Against_Leviathan'),(12,0,'Age_of_Enlightenment'),(12,0,'Agriculturalism'),
(12,0,'Agriculture'),(12,0,'Al-Ghazali'),(12,0,'Alain_Badiou'),(12,0,'Alain_de_Benoist'),(12,0,'Albert_Camus'),(12,0,'Albert_Libertad'),(12,0,'Albert_Meltzer'),(12,0,'Aleister_Crowley'),
(12,0,'Alex_Comfort'),(12,0,'Alexander_Berkman'),(12,0,'Alexandre_Christoyannopoulos'),(12,0,'Alexandre_Skirda'),(12,0,'Alfredo_M._Bonanno')
sparkonhdfs
  • 1,313
  • 2
  • 17
  • 31

1 Answers1

4

I am unable to parse the data because it is not very well documented.

The SQL dumps contain directly data from the MySQL table MediaWiki uses. Those tables are documented on mediawiki.org, in your case it's the pagelinks table.

The fields given are p1_from, p1_namespace, and p1_title.

No, that's not a 1 (the number one), it's an l (the letter L), pl is short for pagelinks.

I do not know what p1_from is.

From the documentation, that's “Key to the page_id of the page containing the link.” To find out the name of the page where the links comes from, you will need the page table.

svick
  • 236,525
  • 50
  • 385
  • 514