-1

I want to get a plain text file from the French Wikipedia dump XML file. To that end, I am applying a Perl script

I can give the full file if necessary, I only added the line

tr/a-zàâééèëêîôûùç-/ /cs;

to the script here: http://mattmahoney.net/dc/textdata.html

However, when I run on linux terminal:

perl filterwikifr.pl frwiki.xml > frwikiplaintext.txt  

the output text file does not print accentuated letters correctly. For example, I get catégorie instead of catégorie...

I also tried:

perl -CS filterwikifr.pl frwiki.xml > frwikiplaintext.txt

without better success (and other variants instead of -CS...)

Mostafa
  • 1,501
  • 3
  • 21
  • 37
  • 2
    The concept of "plain text" doesn't really exist. The output file must be encoded in some format. Do you really mean you only want 7-bit ASCII output? – b4hand Dec 12 '14 at 07:22
  • I only want that accentuated letters stay preserved (and I guess the output should be in UTF-8, but I am not a specialist of unicode). If I open the file with LibreOffice, it works, but with the text editor, it prints weird characters. – Mostafa Dec 12 '14 at 07:31
  • 1
    What are the contents of `$LANG` and `env | grep LC_`? – b4hand Dec 12 '14 at 07:40
  • What "text editor" are you using? If LibreOffice is reading it, then most likely the output file is correct. – b4hand Dec 12 '14 at 07:41
  • I am using gedit in Ubuntu 14.04. – Mostafa Dec 12 '14 at 08:00
  • output of env | grep LC_ LC_PAPER=fr_FR.UTF-8 LC_ADDRESS=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_NUMERIC=fr_FR.UTF-8 LC_TELEPHONE=fr_FR.UTF-8 LC_IDENTIFICATION=fr_FR.UTF-8 LC_MEASUREMENT=fr_FR.UTF-8 LC_TIME=fr_FR.UTF-8 LC_NAME=fr_FR.UTF-8 – Mostafa Dec 12 '14 at 08:01
  • I meant `echo $LANG`. – b4hand Dec 12 '14 at 08:06
  • output of echo $LANG : en_US.UTF-8 – Mostafa Dec 12 '14 at 08:13
  • Why are you running `perl filename.xml`? It should call your Perl script not the xml file. `perl perlscript.pl`. – Chankey Pathak Dec 12 '14 at 08:14
  • If you do this `echo 'tr/a-zàâééèëêîôûùç-/ /cs;' > foo.txt` followed by `perl -p -e 'tr/a-zàâééèëêîôûùç-/ /cs;' < foo.txt`, do you see `tr a-zàâééèëêîôûùç- cs` as output? – b4hand Dec 12 '14 at 08:18
  • @ChankeyPathak yes, you are right – Mostafa Dec 12 '14 at 08:18
  • @b4hand yes, i see tr a-zàâééèëêîôûùç- cs – Mostafa Dec 12 '14 at 08:20

1 Answers1

1

the problem is with the text editor gedit.

If, instead of opening the file directly, I open gedit, and then go to "open" and down, in "Character encoding", I choose UTF-8 instead of "Automatically Detected", then the accents are printed correctly.

Mostafa
  • 1,501
  • 3
  • 21
  • 37