14

I can do it in vim like so:

:%s/\%u2013/-/g

How do I do the equivalent in Perl? I thought this would do it but it doesn't seem to be working:

perl -i -pe 's/\x{2013}/-/g' my.dat
Miles
  • 31,360
  • 7
  • 64
  • 74
stephenmm
  • 2,640
  • 3
  • 30
  • 48

4 Answers4

38

For a generic solution, Text::Unidecode transliterate pretty much anything that's thrown at it into pure US-ASCII.

So in your case this would work:

perl -C -MText::Unidecode -n -i -e'print unidecode( $_)' unicode_text.txt

The -C is there to make sure the input is read as utf8

It converts this:

l'été est arrivé à peine après aôut
¿España es un paìs muy lindo?
some special chars: » « ® ¼ ¶ – – — Ṉ
Some greek letters: β ÷ Θ ¬ the α and ω (or is it Ω?)
hiragana? みせる です
Здравствуйте
السلام عليكم

into this:

l'ete est arrive a peine apres aout
?Espana es un pais muy lindo?
some special chars: >> << (r) 1/4 P - - -- N
Some greek letters: b / Th ! the a and o (or is it O?)
hiragana? miseru desu
Zdravstvuitie
lslm `lykm

The last one shows the limits of the module, which can't infer the vowels and get as-salaamu `alaykum from the original arabic. It's still pretty good I think

Grzegorz Rożniecki
  • 27,415
  • 11
  • 90
  • 112
mirod
  • 15,923
  • 3
  • 45
  • 65
4

This did the trick for me:

perl -C1 -i -pe 's/–/-/g' my.dat

Note that the first bar is the \x{2013} character itself.

Leon Timmermans
  • 30,029
  • 2
  • 61
  • 110
  • 5
    Some explanation of the '-C1' would do wonders. The information is available at http://perldoc.perl.org/perlrun.html (-C1 means 'standard input is in UTF8'). – Jonathan Leffler Feb 22 '10 at 16:43
3

Hmm, a bit tough. This seems to do it (Perl 5.10.0 on MacOS X 10.6.2):

perl -w -e "
use open ':encoding(utf8)';
use open ':std';

while (<>)
{
    s/\x{2013}/-/g;
    print;
}
"

I have not yet minimized that. See perldoc on the 'use open' statement.


Judging from my (limited) experiments, the '-p' option doesn't recognize the 'use open' directives. You can use 'qw()' to quote the words:

perl -w -e "
use open qw( :encoding(utf8) :std );
while (<>)
{
    s/\x{2013}/-/g;
    print;
}

I don't know if '-p' not obeying 'use open' is a bug or a design feature.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

Alternately, you could just specify the UTF-8 encoding of the characters you want to substitute:

perl -i -pe 's/\xE2\x80\x93/-/g' my.dat

Here hex value E28093 is the UTF-8 encoding of hex value 2013. You can find various tools online to get the UTF-8 encoding for a character, or you can just look at my.dat in a hex editor.

Russell Zahniser
  • 16,188
  • 39
  • 30