this is my first post here on stack exchange. I need to do some language analysis on wikipedia, so i downloaded one of their database dumps, in particular the italian wikipedia dump.
I need to extract the page data to plain text, i don't care about any tables or formatting or similar things, actually i would very much prefer to exclude those.
I tried some stuff, like using a couple programs i found on github and then i tried writing my own c++ program to extract the wikitext from the xml and then pass the wikitext to pandoc but all the solutions either didn't work, they gave too many tranlsation errors or they were simply too slow to convert multiple gigabites of text. I could not find better solutions here on stack exchange and the parsers suggested on Mediawiki either don't completely support the wikipedia formats or don't offer plain text output.