I want to get a plain text file from the French Wikipedia dump XML file. To that end, I am applying a Perl script
I can give the full file if necessary, I only added the line
tr/a-zàâééèëêîôûùç-/ /cs;
to the script here: http://mattmahoney.net/dc/textdata.html
However, when I run on linux terminal:
perl filterwikifr.pl frwiki.xml > frwikiplaintext.txt
the output text file does not print accentuated letters correctly. For example, I get catégorie instead of catégorie...
I also tried:
perl -CS filterwikifr.pl frwiki.xml > frwikiplaintext.txt
without better success (and other variants instead of -CS...
)