Converting utf-8 characters to scandic letters

Question

I am struggling with trying to encode a string where scandic letters are in utf-8 format. For example, I would like to convert following string: test_string = "\xc3\xa4\xc3\xa4abc" Into the form of : test_string = "ääabc" The end goal is to send this string to Slack-channel via API. I did some testing, and figured out that Slack handles scandic letters properly. I have tried the following command: test_string= test_string.encode('latin1').decode('utf-8') but this does not change the string at all.

Same goes for the more brute-force method:

def simple_scand_convert(string):
   string = string.replace("\xc3\xa4", "ä")

Again, this does not change the string at all. Any tips or materials from where I could look for the solution?

it should be working now, you forgot to add a return statement for your results data, now every time you call the function, you supply it with the string you need to process with replacement. — Karam Qusai, Jun 21 '21 at 14:14
I forgot to include the return statement in my example, but I already have it in my code. — BeastFromTheEast, Jun 21 '21 at 14:16
`test_string = test_string.encode('latin1').decode('utf-8')` should actually work, see https://stackoverflow.com/questions/42795042/how-to-cast-a-string-to-bytes-without-encoding. I'm not sure what you did wrong. — mkrieger1, Jun 21 '21 at 14:21
Yeah, the `latin1` -> `u8` conversion works fine for me. If you have the ability to read the response directly as bytes instead of passing it through in string form, `b"\xc3\xa4\xc3\xa4abc".decode('u8')` also yields the desired outcome. — Randy, Jun 21 '21 at 14:23
`test_string.encode('latin1').decode('utf-8')` gives me correct result on Linux console. Maybe you use console which doesn't use `utf-8` but other encoding. As I know Windows may use `cp1250` in console. — furas, Jun 21 '21 at 14:59
If the *terminal* where you write your string is not UTF8 enabled, you are likely to see the encoded characters in it. And most IDE internal terminals are not, neither are by default the default Windows *console*. — Serge Ballesta, Jun 21 '21 at 15:38
Just to make sure, what give `import sys; print(sys.version)`, `print(type(test_string))` and `print([(c, hex(ord(c))) for c in text_string])` if previous type is `str` or `unicode`? — Serge Ballesta, Jun 21 '21 at 15:42
@BeastFromTheEast "I forgot to include the return statement in my example". Then *edit your question* so it accurately represents and reproduces the problem you have. — Mark Tolonen, Jun 21 '21 at 16:38
@BeastFromTheEast "Test string is coming inside SOUP-response". Show this code then. It would be better to read the response correctly with UTF-8 in the first place instead of trying to patch the problem with the incorrect response. — Mark Tolonen, Jun 21 '21 at 16:40
Maybe `test_string.encode( 'raw_unicode_escape').decode( 'unicode_escape').encode( 'latin1').decode( 'utf-8')`? — JosefZ, Jun 21 '21 at 19:37
Hi all, sorry for the delayed response. The solution JosefZ provided seemed to fix the problem. I was reading the soup message from an incoming webhook by: ```data = str(request.get_data()) soup = BeautifulSoup(data, "utf-8") test_string = soup.find("test_string").get_text()``` I still have no idea what was the problem with using: ```test_string.encode('latin1').decode('utf-8') ``` But thank you Josef, that really made my day! — BeastFromTheEast, Jun 22 '21 at 06:02
@josefz, you can post that comment as an answer so I can mark the problem solved — BeastFromTheEast, Jun 22 '21 at 06:09

score 0 · Answer 1 · answered Jun 21 '21 at 14:32

Based on the original question and the discussion in the comments, I suspect that you're just not saving the results of the conversion. Python strings are immutable, and so just making changes to a string that's passed into a function won't do anything to the original string:

In [42]: def change_string(s):
    ...:     s = "hello world"
    ...:
    ...: test_s = "still here"
    ...: change_string(test_s)
    ...: print(test_s)
still here

Instead, you'll want to return the results of the conversion in the function and reassign the variable:

In [43]: def change_string(s):
    ...:     s = s.encode('latin1').decode('u8')
    ...:     return s
    ...:
    ...: test_s = "\xc3\xa4\xc3\xa4abc"
    ...: test_s = change_string(test_s)
    ...: print(test_s)
ääabc

score 0 · Accepted Answer · answered Jun 22 '21 at 17:20

I can't reproduce your reading the soup message from an incoming webhook code snippet; therefore, my answer is based on hard-coded data, and shows how Python specific text encodings raw_unicode_escape and unicode_escape work in detail:

test_string = "\\xc3\\xa5\\xc3\\xa4___\xc3\xa5\xc3\xa4"    # hard-coded
print('test_string                  ', test_string)
print('.encode("raw_unicode_escape")',
  test_string.encode( 'raw_unicode_escape'))
print('.decode(    "unicode_escape")',
  test_string.encode( 'raw_unicode_escape').decode( 'unicode_escape'))
print('.encode("latin1").decode()   ', 
  test_string.encode( 'raw_unicode_escape').decode( 'unicode_escape').
              encode( 'latin1').decode( 'utf-8'))

Output: \SO\68069394.py

test_string                   \xc3\xa5\xc3\xa4___Ã¥Ã¤
.encode("raw_unicode_escape") b'\\xc3\\xa5\\xc3\\xa4___\xc3\xa5\xc3\xa4'
.decode(    "unicode_escape") Ã¥Ã¤___Ã¥Ã¤
.encode("latin1").decode()    åä___åä

Converting utf-8 characters to scandic letters

2 Answers2

Linked