Why sed's dot doesn't match ù in latin1 encoding?

Question

I have two files containing the text aùb, but one, critic_utf8 is encoded in UTF-8 and the other, critic_latin1, in latin1, so their content is like this

$ od -a critic_utf8 
0000000   a   C   9   b  nl
0000005
$ od -a critic_latin1 
0000000   a   y   b  nl
0000004

Now, leaving aside that I don't know what that y (which corresponds to ù) in the second output is (and I'd like to understand, so a subquestion is: what is that y?), it seems to me that Sed's . doesn't match it:

$ sed 's/.*/x/' critic_latin1 
xùb
$ sed 's/.*/x/' critic_utf8 
x
$ sed 's/./x/g' critic_latin1 
xùx
$ sed 's/./x/g' critic_utf8 
xxx

What does this mean? That Sed cannot work with latin1-encoded text files? Still, I thought . would match everything but newline, but here it is also not matching something else. And I know that ù is not reacting to . as \n would do, as proved by this:

$ sed -z 's/.*/x/' critic_latin1 
xùb

I've noticed this while playing around with *.idx and *.dat files (those with words and synonyms), when trying to experiment what I found in this answer.

@WiktorStribiżew, but why doesn't `.` consume all those bytes? I think I just don't understand what `.` matches in the two cases, I'm afraid. — Enlico, Jan 08 '23 at 19:34

Arnaud Valmary · Answer 1 · 2023-01-15T18:17:50.227

Two steps:

sed command reads your file with LANG variable content formatted with language_COUNTRY.CHARSET
The sed command output is interpreted by your terminal following its own configuration

I reproduce your output with a LANG variable configured with UTF-8 charset and a terminal configured with ISO-8859-1 (latin1) encoding :

> export LANG=fr_FR.UTF-8; echo "latin1"; sed 's/.*/x/' critic_latin1 ; echo "utf-8"; sed 's/.*/x/' critic_utf8; echo "latin1/g"; sed 's/./x/g' critic_latin1; echo "utf-8/g"; sed 's/./x/g' critic_utf8
latin1
xùb
utf-8
x
latin1/g
xùx
utf-8/g
xxx

A LANG value with UTF-8 said to sed to work with UTF-8 characters but in your critic_latin1 you have a ù character encoded in ISO-8859-1 (only one byte). This character is not valid in UTF-8. So sed does not treat unknown (invalid) characters.

If you want to work with files encoded differently than your LANG variable, prefix you works with LANG=... like this:

> export LANG=fr_FR.ISO-8859-1; echo "latin1"; sed 's/.*/x/' critic_latin1 ; echo "utf-8"; sed 's/.*/x/' critic_utf8; echo "latin1/g"; sed 's/./x/g' critic_latin1; echo "utf-8/g"; sed 's/./x/g' critic_utf8
latin1
x
utf-8
x
latin1/g
xxx
utf-8/g
xxxx

It's really useful with data text files (like ISAM).

By "_do not trait_" do you mean "_does not treat_"? – AmigoJack Jan 15 '23 at 08:11 — AmigoJack, Jan 15 '23 at 08:11

Why sed's dot doesn't match ù in latin1 encoding?

1 Answers1