what would be the best approach to extract one language form wiktionary?

Question

I have searched but not found what I want, which is:

the best and most efficient to extract all Italian words, etymologies and parts of speech... including plural forms of words (amico, amichi) from wiktionary. I would like to put it into either a CSV (maybe too larg though) or a MySQL db as pure text (not blobs).

I want essential a record for each Italian word in English.

mwdumper keeps crashing too.

any advice would be welcome!

Jacopofar · Answer 1 · 2013-05-15T08:15:21.373

2

I created a small Java program which extracts part of speech (verb, nound, adjective, adn so on) from the en.wiktionary XML dump, here, it uses TSV but can be adapted easily.

edited May 15 '13 at 08:15

answered May 14 '13 at 12:28

Jacopofar

3,407
2
19
29

Jackopo,I tried to compile it and got this error:POSfromDump.java:20: error: class GeneraDatabasePOS is public, should be declared in a file named GeneraDatabasePOS.java public class GeneraDatabasePOS { (sorry i have never compiled java before) – esponapule May 14 '13 at 23:40
You have to save it in a file called as the class, that is, GeneraDatabasePOS.java. You'll also have to change the lines of code containing the file paths. – Jacopofar May 15 '13 at 07:43
Also you'll have to put it into a folder called "generazione" and run it with `java generazione.GeneraDatabasePOS` – Jacopofar May 15 '13 at 07:49
You can see an updated version [here](https://github.com/jacopofar/wikidump-tools) or directly download the file "POS_list_IT_mar_2013.txt", it's 8.7 MB and contains 486481 terms. – Jacopofar May 15 '13 at 08:14

what would be the best approach to extract one language form wiktionary?

1 Answers1