I have two files containing the text aùb
, but one, critic_utf8
is encoded in UTF-8 and the other, critic_latin1
, in latin1, so their content is like this
$ od -a critic_utf8
0000000 a C 9 b nl
0000005
$ od -a critic_latin1
0000000 a y b nl
0000004
Now, leaving aside that I don't know what that y
(which corresponds to ù
) in the second output is (and I'd like to understand, so a subquestion is: what is that y
?), it seems to me that Sed's .
doesn't match it:
$ sed 's/.*/x/' critic_latin1
xùb
$ sed 's/.*/x/' critic_utf8
x
$ sed 's/./x/g' critic_latin1
xùx
$ sed 's/./x/g' critic_utf8
xxx
What does this mean? That Sed cannot work with latin1-encoded text files? Still, I thought .
would match everything but newline, but here it is also not matching something else. And I know that ù
is not reacting to .
as \n
would do, as proved by this:
$ sed -z 's/.*/x/' critic_latin1
xùb
I've noticed this while playing around with *.idx
and *.dat
files (those with words and synonyms), when trying to experiment what I found in this answer.