2

I’m trying to fix an encoding error in an archived html page. My problem is that sed is behaving strangely, as it doesn't catch special characters in the data. I tried both with and without -r switch.

My data is the following: Budapesti ??p?­t?©szeti Filmnapok k??l??nkiad??s

The sed command:

sed -i.bak 's|Budapesti.*|REPLACE|g' index.html

and the result I get without recode:

REPLACE�t?�szeti Filmnapok k??l??nkiad??s

The result I'm expecting is:

REPLACE

It seems to be related to the encoding somehow. If I do recode iso-8859-2 index.html first, sed works fine and gets me the expected output.

Here are the hex bytes for the i ??p?­t?Šs part before recode:

69 20 3F 3F 70 3F AD 74 3F A9 73

and after recode:

69 20 3F 3F 70 3F C2 AD 74 3F C5 A0 73

BTW, this is what I get without recode:

REPLACE­t?Šs 52 45 50 4C 41 43 45 AD 74 3F A9 73

I'm using the latest gsed (GNU sed) 4.2.2.

hyperknot
  • 13,454
  • 24
  • 98
  • 153

1 Answers1

1
LANG=C.ISO-8859-2 sed -i.bak 's|Budapesti.*|REPLACE|g' index.html

Cygwin terminal not displaying certain characters?

Community
  • 1
  • 1
Zombo
  • 1
  • 62
  • 391
  • 407