Your string is using a multibyte encoding, specifically UTF-8; the curly quotes are three bytes each. But your sed
implementation is treating each byte as a separate character. This is probably due to your locale settings. I can reproduce your problem by setting my locale to "C" (the old default POSIX locale, which assumes ASCII):
$ LC_ALL=C sed -E "s/[‘’]/'/g" <<<'‘foo’' # C locale, single-byte chars
'''foo'''
But in my normal locale of en_US.UTF-8 ("US English encoded with UTF-8"), I get the desired result:
$ LC_ALL=en_US.UTF-8 sed -E "s/[‘’]/'/g" <<<'‘foo’' # UTF-8 locale, multibyte chars
'foo'
The way you're running it, sed
doesn't see [‘‘]
as a sequence of four characters but of eight. So each of the six bytes between the brackets – or at least, each of the four unique values found in those bytes – is considered a member of the character class, and each matching byte is separately replaced by the apostrophe. Which is why your three-byte curly quotes are getting replaced by three apostrophes each.
The version that uses alternation works because each alternate can be more than one character; even though sed
is still treating ‘ and ’ as three-character sequences instead of individual characters, that treatment doesn't change the result.
So make sure your locale is set properly for your text encoding and see if that resolves your issue.