SED grabbing special characters

Question

I’m trying to fix an encoding error in an archived html page. My problem is that sed is behaving strangely, as it doesn't catch special characters in the data. I tried both with and without -r switch.

My data is the following: Budapesti ??p?t?©szeti Filmnapok k??l??nkiad??s

The sed command:

sed -i.bak 's|Budapesti.*|REPLACE|g' index.html

and the result I get without recode:

REPLACE�t?�szeti Filmnapok k??l??nkiad??s

The result I'm expecting is:

REPLACE

It seems to be related to the encoding somehow. If I do recode iso-8859-2 index.html first, sed works fine and gets me the expected output.

Here are the hex bytes for the i ??p?t?Šs part before recode:

69 20 3F 3F 70 3F AD 74 3F A9 73

and after recode:

69 20 3F 3F 70 3F C2 AD 74 3F C5 A0 73

BTW, this is what I get without recode:

REPLACEt?Šs 52 45 50 4C 41 43 45 AD 74 3F A9 73

I'm using the latest gsed (GNU sed) 4.2.2.

It seems to be related to the encoding. I've added more information to the question, including a recode command and the hex code of the file at the problematic characters. — hyperknot, Nov 27 '14 at 23:37
What if you say `sed -i.bak 's|Budapesti.*$|REPLACE|g' index.html `? that is, to use `$` to indicate end of line. — fedorqui, Nov 27 '14 at 23:38

score 1 · Answer 1 · edited May 23 '17 at 12:05

1

LANG=C.ISO-8859-2 sed -i.bak 's|Budapesti.*|REPLACE|g' index.html

Cygwin terminal not displaying certain characters?

edited May 23 '17 at 12:05

Community

1
1

answered Nov 28 '14 at 05:46

Zombo

1
62
391
407

This one works fine, but why does sed stop on that 3F character without encoding set? – hyperknot Nov 28 '14 at 11:06
@zsero mine stops at AD, so not sure – Zombo Nov 28 '14 at 16:01
I mean 3F is the last character which disappears and AD is the first one which stays. – hyperknot Nov 28 '14 at 16:51

SED grabbing special characters

1 Answers1