0

I'm getting strange results when I try to find and replace curly quotes inside a character class, with another character:

sed -E "s/[‘’]/'/g" in.txt > out.txt

in.txt:  ‘foo’
out.txt: '''foo'''

If you use a as a replacement, you'll get aaafooaaa. But this is only an issue when the curly quotes are inside a character class. This works:

sed -E "s/(‘|’)/'/g" in.txt > out.txt

in.txt:  ‘foo’
out.txt: 'foo'

Can anyone explain what's going on here? Can I still use a character class for curly quotes?

Daan
  • 1,417
  • 5
  • 25
  • 40
  • Cannot reproduce (see snippet [here](https://tio.run/##S0oszvj/vzg1RUHXVUGpWD/6UcOMRw0zY/XV9dOV/v8H8tLy84ECAA)) – ctwheels May 14 '20 at 19:45

1 Answers1

2

Your string is using a multibyte encoding, specifically UTF-8; the curly quotes are three bytes each. But your sed implementation is treating each byte as a separate character. This is probably due to your locale settings. I can reproduce your problem by setting my locale to "C" (the old default POSIX locale, which assumes ASCII):

$ LC_ALL=C sed -E "s/[‘’]/'/g" <<<'‘foo’' # C locale, single-byte chars
'''foo'''

But in my normal locale of en_US.UTF-8 ("US English encoded with UTF-8"), I get the desired result:

$ LC_ALL=en_US.UTF-8 sed -E "s/[‘’]/'/g" <<<'‘foo’' # UTF-8 locale, multibyte chars
'foo'

The way you're running it, sed doesn't see [‘‘] as a sequence of four characters but of eight. So each of the six bytes between the brackets – or at least, each of the four unique values found in those bytes – is considered a member of the character class, and each matching byte is separately replaced by the apostrophe. Which is why your three-byte curly quotes are getting replaced by three apostrophes each.

The version that uses alternation works because each alternate can be more than one character; even though sed is still treating ‘ and ’ as three-character sequences instead of individual characters, that treatment doesn't change the result.

So make sure your locale is set properly for your text encoding and see if that resolves your issue.

Mark Reed
  • 91,912
  • 16
  • 138
  • 175
  • 1
    Note that this depends on having a locale-aware version of `sed` that supports UTF-8 -- some older versions might not know how to do this. – Gordon Davisson May 14 '20 at 19:50
  • Thank you! That works. I've now put `LC_ALL=nl_NL.UTF-8` in my `~/.bash_profile`. Is that a sensible solution? Or might that lead to problems with other Bash scripts? – Daan May 14 '20 at 20:06
  • Also thanks for the explanation. It's quite tricky this stuff. – Daan May 14 '20 at 20:08
  • 1
    Depending on your host OS you may have a better way of setting the system locale than just setting LC_ALL in your personal profile, but that's a fine solution. You might find the odd bash script that makes bad assumptions, but I haven't had any trouble to speak of, and I've been using UTF-8 on the command line since before it was cool. :) – Mark Reed May 14 '20 at 20:11
  • Ok, great. I'm using MacOS. Correction: I have to use `export LC_ALL=nl_NL.UTF-8` to make it work. The default value of `LC_ALL` is strangely empty even though all the other LC variables are set correctly. This is not a problem for running scripts inside the Terminal, but outside the Terminal you have to use the export line. – Daan May 14 '20 at 21:40
  • Well, the specific setting that matters here is actually `LC_CTYPE`, but it's usually best to have the locale settings agree with each other. On my Mac, the system sets `LANG`, whose value is reported by `locale -a` as the value of all the individual LC_ parameters (except LC_ALL) even though the individual $LC_whatever environment variables are not set. – Mark Reed May 15 '20 at 17:30
  • I've made a [nice little overview](https://i.stack.imgur.com/G6njm.png) of the different results of `locale` I get. As you can see, it uses "C" outside of the Terminal by default. – Daan May 15 '20 at 19:58
  • Yeah, MacOS doesn't use the POSIX locale settings natively - it has its own controls in System Preferences. So if you launch something from the GUI without doing the login shell initialization that you get in Terminal, none of the envars are set and everything will default to "C". – Mark Reed May 15 '20 at 20:25