how can I convert wikipedia dump to plaintext efficiently

Asked Dec 06 '22 at 10:57

Active Dec 06 '22 at 10:57

Viewed 54 times

this is my first post here on stack exchange. I need to do some language analysis on wikipedia, so i downloaded one of their database dumps, in particular the italian wikipedia dump.
I need to extract the page data to plain text, i don't care about any tables or formatting or similar things, actually i would very much prefer to exclude those.

I tried some stuff, like using a couple programs i found on github and then i tried writing my own c++ program to extract the wikitext from the xml and then pass the wikitext to pandoc but all the solutions either didn't work, they gave too many tranlsation errors or they were simply too slow to convert multiple gigabites of text. I could not find better solutions here on stack exchange and the parsers suggested on Mediawiki either don't completely support the wikipedia formats or don't offer plain text output.

asked Dec 06 '22 at 10:57

giorgioa

What lind of language analysis ? You rarely need more than few 1000's of articles for this purpose in general. – IRA1777 Dec 07 '22 at 09:46

how can I convert wikipedia dump to plaintext efficiently

0 Answers0