-1

Consider the following text:

sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"

Notice that there is a regular single quote in "fox's" and a right single quote in "it’s"

So my purpose is to get the original text representation of those encoded characters in sample_text, but not able to do so completely.

I did the following:

>>> sample_text.encode().decode('unicode-escape')
"The fox's color was "brown" and itâ\x80\x99s speed was quick"

Now my question is, is there any way I could get the original right single quote after decoding that sample_text . With my code's output, you can see that it's giving me itâ\x80\x99s instead. I want it to be: it’s

Edit: As suggested in the comments, I'm adding the output of print(sample_text)

print(sample_text)
output: The fox's color was \u201Cbrown\u201D and it’s speed was quick

Edit: I'm using python 3.8.10 and Ubuntu

impy
  • 31
  • 7
  • I don't understand why someone downvoted my question. If there's anything wrong about it, please help me by mentioning it so I can improve it. Thanks! – impy Jul 07 '23 at 20:37
  • Please [edit] your question to improve your [mcve]. In particular, share `print(sample_text)`. BTW, if you define your *raw string* as `sample_text = r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"` then `sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape')` should do the trick… – JosefZ Jul 07 '23 at 20:53
  • Try `sample_text.encode().decode()` (without passing 'unicode-escape' as parameter to the decode-method) or just `print(sample_text)`. In my python console in linux that's working fine. – kzi Jul 07 '23 at 20:53
  • @JosefZ I have already provided the output of my code and a clear reproducible example. Thanks for mentioning it though. and unfortunately your suggestion didn't work. It doesn't decode the \\u201Cbrown\\u201D part – impy Jul 07 '23 at 21:01
  • @kzi thanks for the comment. your suggestion leaves out \\u201Cbrown\\u201D undecoded – impy Jul 07 '23 at 21:02
  • Just share `print(sample_text)`, please. – JosefZ Jul 07 '23 at 21:02
  • Which OS and version of Python are you using? It makes a huge difference. – Mark Ransom Jul 07 '23 at 21:24
  • @MarkRansom I'm using python 3.8.10 and ubuntu – impy Jul 07 '23 at 21:34
  • Thanks for the info. Both your OS and Python should be using UTF-8 so I really don't understand why there's a problem. – Mark Ransom Jul 08 '23 at 02:53
  • @MarkRansom I suspect the problem is a misunderstanding of the role of C-style Unicode escapes in Python, i.e. the string is fine as it was, introducing encode/decode pairs has just muddied the waters. – Andj Jul 09 '23 at 10:59

3 Answers3

1

If i understand your question correctly, there are two parts to it:

  1. a concern about the presence of C-style Unicode escapes in your string, and
  2. How to handle the apostrophe like character in "it’s".

Your question indicates that you are using Python 3.8.10 and Ubuntu, so your ecosystem will be using Unicode (UTF-8), so there shouldn't be a need to use encode/decode pairs if your string is "The fox's color was \u201Cbrown\u201D and it’s speed was quick".

sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
print(sample_text)
# The fox's color was “brown” and it’s speed was quick

I'm using macOS (and thus musl libc) rather than Ubuntu (and glibc) but the behaviour should be the same.

For Python, the escaped character is the same as the actual character, so:

import unicodedata as ud
print('\u201C' == '“')
# True
print(ud.name("\u201C"))
# LEFT DOUBLE QUOTATION MARK
print(ud.name('“'))
# LEFT DOUBLE QUOTATION MARK

If you avoid the encode/decode pairs then it should resolve your second problem.

Although your string has other issues. Looking at words in your string:

fox's uses U+0027 (APOSTROPHE), “brown” uses U+201C (LEFT DOUBLE QUOTATION MARK) and U+201D (RIGHT DOUBLE QUOTATION MARK), and it’s uses U+2019 (RIGHT SINGLE QUOTATION MARK)

You are using U+0027 and U+2019 for the same purpose. It would be useful to cleanup the string. Since you are using smart quotes elsewhere:

sample_text = sample_text.replace('\u0027', '\u2019')
print(sample_text)
# The fox’s color was “brown” and it’s speed was quick

You discuss the need to get the original text representation of your string. Your string may be the original, as it is. The fact that you are using smart double quotes, would imply that your apostrophe/right single quotes should probably be right single quotes to match the smart double quotes. What the original string is, would be a combination of what keystrokes were used, and what editing controls were used to create the original string. But that takes you down a complex rabbit hole.

It would be a cleaner approach to think in terms of normalising your string, i.e. choosing a preferred Unicode character for apostrophe like characters. That is the approach I took above, using str.replace() to normalise the string using smart quotes consistently in the string. Obviously your could normalise away from smart quotes to the Basic Latin (ASCII) quotes:

sample_text = sample_text.replace('\u2019', '\u0027').replace('\u201C', '"').replace('\u201D', '"')
print(sample_text)
# The fox's color was "brown" and it's speed was quick
Andj
  • 481
  • 3
  • 8
0

According to your post and your edits this should work for you:

>>> text_part_1 = "The fox's color was "
>>> text_part_2 = " and it’s speed was quick"
>>> color = "\u201Cbrown\u201D"
>>> color = color.encode().decode('unicode-escape')
>>> print(f'{text_part_1}{color}{text_part_2}')

To avoid confusion, I have to add that this is not working for me, but it's giving me this:

>>> print(f'{text_part_1}{color}{text_part_2}')
The fox's color was âbrownâ and it’s speed was quick

(I'm using python 3.10.6 in Ubuntu 22.04.2 in WSL2 right now)

But since the color was output correctly in your code sample

>>> sample_text.encode().decode('unicode-escape')
"The fox's color was "brown" and itâ\x80\x99s speed was quick"

it should work for you.

kzi
  • 186
  • 1
  • 8
0

Read about unicode-escape in Python Specific Encodings (my emphasizing):

Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decode from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.

Hence, .encode().decode('unicode_escape') causes a mojibake case as follows:

'it’s'.encode()                            # b'it\xe2\x80\x99s'
'it’s'.encode().decode('unicode_escape')   #  'itâ\x80\x99s'
'it’s'.encode().decode('latin-1')          #  'itâ\x80\x99s'
'it’s'.encode().decode('unicode_escape') == 'it’s'.encode().decode('latin-1')
 #                                         # True

Solution in the following code; :

sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
print(sample_text)    # regular python text
sample_text =r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"
print(sample_text)    # raw python text
print(sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape'))

Linux:

~$ python3
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
>>> print(sample_text)
The fox's color was “brown” and it’s speed was quick
>>> sample_text =r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"
>>> print(sample_text)

The fox's color was \u201Cbrown\u201D and it’s speed was quick

>>> print(sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape'))
The fox's color was “brown” and it’s speed was quick
>>>

Windows:

Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]
IPython 8.14.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: sample_text = "The fox's color was \u201Cbrown\u201D and it’s speed was quick"
   ...: print(sample_text)
   ...: sample_text =r"The fox's color was \u201Cbrown\u201D and it’s speed was quick"
   ...: print(sample_text)
   ...: print(sample_text.encode( 'raw_unicode_escape').decode( 'unicode_escape'))
   ...:
The fox's color was “brown” and it’s speed was quick
The fox's color was \u201Cbrown\u201D and it’s speed was quick
The fox's color was “brown” and it’s speed was quick
In [2]:
JosefZ
  • 28,460
  • 5
  • 44
  • 83