1

I'll be the first to admit I'm not the smartest person in the world, but I'm at a loss on this one.

I want to have access to the words and details of each word of the English Wiktionary project. I saw they do data dumps, and got all excited. That lasted all of 3 seconds. Since then, all I've done is swear and smoke in bouts of frustration and irritation.

I'm using windows 7.
I've installed the latest version of xampp (64 bit, installed at root).
I've installed the latest Java DK.
I've set Xampp and JDK to run as admin.
I've grabbed the article-pages files.
I've decompressed them.
I've used the mwxml2sql tool.
I couldn't get it to run (no matter what settings/flags I tried).
I used the GUI version of the mwxml2sql tool.
It ran - and then errored at 4300 rows.
The error was about duple keys in name_title.

I've looked at wikokit - but that seems a few years behind.

I'm at a loss.

I've looked at the data that did get into the DB before the dupe-key error.
I can see some data in Blob format.
How am I meant to access that information via php?

Is there not a decent (as in "idiots" :D) guide for this?
Do I really have to grab all the files, install a wiki, parse the files?
How am I meant to handle the dupe key issues (not like I can open up the sql file and find the relevant line!)?

So, please - has anyone done this or know of a way to do it?
The only thing I can think of is to actually try and scrape the site - which I'd rather not do (and nor would the wiki group).

In case it is relevant - I'm specifically after the word-form, the PoS, the pronunciations, the definitions, any phrases and related words. Things like etymology etc. would be nice, but aren't as important.

If it is suggested, yes, I've looked at WordNet (managed to find a mysql dump, and got that working). I've also seen resources like MRC and the CMU dict - but none have the right permissions. That's why Wiktionary looked so attractive. But it seems the format/dumps are far from friendly :(

So, any help or ideas ? Alternative sources, guides, walk-through ... all would help.
Alternatively, if you can tell me what is causing the error and how to get around it, and how to access the word data, that would be superb.

Sincerley yours - frustrated.

1 Answers1

1

I've looked at wikokit - but that seems a few years behind.

No, wikokit project is alive :) link: https://github.com/componavt/wikokit

You can download the parsed English Wiktionary database: http://whinger.krc.karelia.ru/soft/wikokit/index.html Upload the SQL dump file to MySQL and play with definitions, synonyms, and translations extracted from the English Wiktionary.

  • Hi there Andrew - and thank you for the reply. || I'm trying to use the 20150413.sql file. I've created an empty db, disabled Auto Commit, Unique Checks and Foreign keys, I've upped the mem limits for innodb. But it errored (something about inserting null). || I'm, trying it again to see if I can copy out the full error message. – often frustrated Oct 13 '15 at 09:40
  • Okay, 2 hours; 10.1Gb in (4.5Gb data/5.6Gb index) - it crashed out 11.9Gb (5.2Gb/6.7Gb) last time ... so close ... as soon as it errors, I'll post the error (I cannot believe I have to spend 2 Hours just to get an error!) – often frustrated Oct 13 '15 at 12:15
  • Okay - so cmd crashed @ 11.5Gb (5/6.5) ... no error message, just a ton of bleeps. Is there no alternative method than this? Though I want this data, I may be better of scraping Wiktionary that wasting days faffing around with huge data globs that simply result in errors. – often frustrated Oct 13 '15 at 12:42
  • After ages, the cmd window has thrown some visible errors... the top most ones read as :: ERROR 1064 <42000>: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near "text" VALUES <32538267,'{{shortcut:WT:RE:/la/L&S}}\nThe following words were tak' at line 1|| The bottom error read as :: Error 1231 <42000>: Variable 'sql_notes' can't be set tot he value of 'NULL' :: So what the blazes is causing the errors????? – often frustrated Oct 13 '15 at 13:16
  • Sigh. Even using the -f force option, it still failed, at about the same point! But instead of the previous errors, I get "ERROR 2006 at line 6101: MySQL server has gone away". ... That's the third time it's failed. I can only assume there is something wrong with the SQL from the wikokit site :( – often frustrated Oct 13 '15 at 20:31
  • Hi! I think that this is a problem with encoding. 1) There is an encoding of MySQL system, (2) encoding of new created database, (3) encoding of this dump 20150413.sql If there is a problem with an encoding then two different words (e.g. Schütze and Schutze) will be the same after uploading to your database. And you will got an error. You can try to drop and create again the database. Then write in MySQL commandline: mysql> SET NAMES binary; mysql> SOURCE /path/to/20150413.sql; Good luck! – Andrew Krizhanovsky Oct 14 '15 at 11:54
  • Well, I'm in the middle of attempt 6 at the moment (more tweaks/adjustments), but I'll give that a shot if this fails. || you said "set names binary". What is "names"? the DB name? Table names? Character set? Collation? Or is the actual command "names" ? – often frustrated Oct 14 '15 at 12:03
  • Okay, it appears to have finished ... without the need to redo it (no need for the binary - but I'd set the DB to utf8 bin anyway (the difference was sorting out the memory and ignoring errors)?). So, 39 Tables & 54,695,521 rows? Sound about right? – often frustrated Oct 14 '15 at 14:03
  • Great! You can test the result database with the help of the project https://github.com/componavt/piwidict. But it requires some experience with PHP. – Andrew Krizhanovsky Oct 15 '15 at 17:28