I’m trying to fix an encoding error in an archived html page. My problem is that sed is behaving strangely, as it doesn't catch special characters in the data. I tried both with and without -r switch.
My data is the following:
Budapesti ??p?t?©szeti Filmnapok k??l??nkiad??s
The sed command:
sed -i.bak 's|Budapesti.*|REPLACE|g' index.html
and the result I get without recode:
REPLACE�t?�szeti Filmnapok k??l??nkiad??s
The result I'm expecting is:
REPLACE
It seems to be related to the encoding somehow. If I do recode iso-8859-2 index.html
first, sed works fine and gets me the expected output.
Here are the hex bytes for the i ??p?t?Šs
part before recode:
69 20 3F 3F 70 3F AD 74 3F A9 73
and after recode:
69 20 3F 3F 70 3F C2 AD 74 3F C5 A0 73
BTW, this is what I get without recode:
REPLACEt?Šs
52 45 50 4C 41 43 45 AD 74 3F A9 73
I'm using the latest gsed (GNU sed) 4.2.2.