Change from HTML character references to utf-8 in a bash script ie. ā becomes ā

Question

How would you go about translating a document that contains the following character references to their actual readable characters in a bash script?

&#257; &#225; &#462; &#224; &#275; &#233; &#283; &#232; &#299; &#237; &#464; &#236; &#470; &#472; &#474; &#476; &#252; &#470; &#472; &#474; &#476; &#252;

These change in order to ā á ǎ à ē é ě è ī í ǐ ì ǖ ǘ ǚ ǜ ü ǖ ǘ ǚ ǜ ü

stackoverflow allows HTML entities. Might want to edit that. — Devin Jeanpierre, Feb 23 '09 at 04:40
My first response is to use sed, if it's just those entities. Direct replacement should be possible that way. If you want it to work for arbitrary entities, though, then I can't think of anything offhand (I'm not a major sh person, sadly). — Devin Jeanpierre, Feb 23 '09 at 04:41

score 3 · Accepted Answer · edited Sep 27 '11 at 14:02

3

If you have access to Perl then it's relatively simple:

perl -ne 'binmode STDOUT,":utf8";s/&#([0-9]*);/pack("U",$1)/eg;print' \
  document.html

Example:

#!/bin/bash
html2utf8() {
  perl -ne 'binmode STDOUT, ":utf8"; s/&#([0-9]*);/pack("U",$1)/eg; print'
}
echo 'testing 1 &#257; 2 &#300; 3 &#275;' | html2utf8

Produces:

testing 1 ā 2 Ĭ 3 ē

edited Sep 27 '11 at 14:02

answered Feb 23 '09 at 04:53

vladr

65,483
18
129
130

Yep I've got access to perl so is probably the easiest and neatest way to do it. Honestly The whole project would be best scipted in pearl anyway – Feb 23 '09 at 06:50

score 1 · Answer 2 · answered Jul 18 '10 at 03:16

1

If you're looking for a bash only way of doing this, it looks like there are some solutions in this thread: http://forums.gentoo.org/viewtopic-t-820377-view-previous.html?sid=b35246f20410ba95ee048970d01ac6b3

answered Jul 18 '10 at 03:16

Menachem

173
8

Change from HTML character references to utf-8 in a bash script ie. ā becomes ā

2 Answers2