5

How would you go about translating a document that contains the following character references to their actual readable characters in a bash script?

ā á ǎ à ē é ě è ī í ǐ ì ǖ ǘ ǚ ǜ ü ǖ ǘ ǚ ǜ ü

These change in order to ā á ǎ à ē é ě è ī í ǐ ì ǖ ǘ ǚ ǜ ü ǖ ǘ ǚ ǜ ü

Sam
  • 1,509
  • 3
  • 19
  • 28
  • stackoverflow allows HTML entities. Might want to edit that. – Devin Jeanpierre Feb 23 '09 at 04:40
  • My first response is to use sed, if it's just those entities. Direct replacement should be possible that way. If you want it to work for arbitrary entities, though, then I can't think of anything offhand (I'm not a major sh person, sadly). – Devin Jeanpierre Feb 23 '09 at 04:41

2 Answers2

3

If you have access to Perl then it's relatively simple:

perl -ne 'binmode STDOUT,":utf8";s/&#([0-9]*);/pack("U",$1)/eg;print' \
  document.html

Example:

#!/bin/bash
html2utf8() {
  perl -ne 'binmode STDOUT, ":utf8"; s/&#([0-9]*);/pack("U",$1)/eg; print'
}
echo 'testing 1 ā 2 Ĭ 3 ē' | html2utf8

Produces:

testing 1 ā 2 Ĭ 3 ē
vladr
  • 65,483
  • 18
  • 129
  • 130
  • Yep I've got access to perl so is probably the easiest and neatest way to do it. Honestly The whole project would be best scipted in pearl anyway –  Feb 23 '09 at 06:50
1

If you're looking for a bash only way of doing this, it looks like there are some solutions in this thread: http://forums.gentoo.org/viewtopic-t-820377-view-previous.html?sid=b35246f20410ba95ee048970d01ac6b3

Menachem
  • 173
  • 8