0

I am struggling with trying to encode a string where scandic letters are in utf-8 format. For example, I would like to convert following string: test_string = "\xc3\xa4\xc3\xa4abc" Into the form of : test_string = "ääabc" The end goal is to send this string to Slack-channel via API. I did some testing, and figured out that Slack handles scandic letters properly. I have tried the following command: test_string= test_string.encode('latin1').decode('utf-8') but this does not change the string at all.

Same goes for the more brute-force method:

def simple_scand_convert(string):
   string = string.replace("\xc3\xa4", "ä")

Again, this does not change the string at all. Any tips or materials from where I could look for the solution?

Randy
  • 14,349
  • 2
  • 36
  • 42
  • it should be working now, you forgot to add a return statement for your results data, now every time you call the function, you supply it with the string you need to process with replacement. – Karam Qusai Jun 21 '21 at 14:14
  • I forgot to include the return statement in my example, but I already have it in my code. – BeastFromTheEast Jun 21 '21 at 14:16
  • 4
    Where is the `test_string` coming from? – mkrieger1 Jun 21 '21 at 14:18
  • Test string is coming inside SOUP-response – BeastFromTheEast Jun 21 '21 at 14:20
  • `test_string = test_string.encode('latin1').decode('utf-8')` should actually work, see https://stackoverflow.com/questions/42795042/how-to-cast-a-string-to-bytes-without-encoding. I'm not sure what you did wrong. – mkrieger1 Jun 21 '21 at 14:21
  • Yeah, the `latin1` -> `u8` conversion works fine for me. If you have the ability to read the response directly as bytes instead of passing it through in string form, `b"\xc3\xa4\xc3\xa4abc".decode('u8')` also yields the desired outcome. – Randy Jun 21 '21 at 14:23
  • 1
    `test_string.encode('latin1').decode('utf-8')` gives me correct result on Linux console. Maybe you use console which doesn't use `utf-8` but other encoding. As I know Windows may use `cp1250` in console. – furas Jun 21 '21 at 14:59
  • If the *terminal* where you write your string is not UTF8 enabled, you are likely to see the encoded characters in it. And most IDE internal terminals are not, neither are by default the default Windows *console*. – Serge Ballesta Jun 21 '21 at 15:38
  • Just to make sure, what give `import sys; print(sys.version)`, `print(type(test_string))` and `print([(c, hex(ord(c))) for c in text_string])` if previous type is `str` or `unicode`? – Serge Ballesta Jun 21 '21 at 15:42
  • 1
    @BeastFromTheEast "I forgot to include the return statement in my example". Then *edit your question* so it accurately represents and reproduces the problem you have. – Mark Tolonen Jun 21 '21 at 16:38
  • 1
    @BeastFromTheEast "Test string is coming inside SOUP-response". Show this code then. It would be better to read the response correctly with UTF-8 in the first place instead of trying to patch the problem with the incorrect response. – Mark Tolonen Jun 21 '21 at 16:40
  • Maybe `test_string.encode( 'raw_unicode_escape').decode( 'unicode_escape').encode( 'latin1').decode( 'utf-8')`? – JosefZ Jun 21 '21 at 19:37
  • Hi all, sorry for the delayed response. The solution JosefZ provided seemed to fix the problem. I was reading the soup message from an incoming webhook by: ```data = str(request.get_data()) soup = BeautifulSoup(data, "utf-8") test_string = soup.find("test_string").get_text()``` I still have no idea what was the problem with using: ```test_string.encode('latin1').decode('utf-8') ``` But thank you Josef, that really made my day! – BeastFromTheEast Jun 22 '21 at 06:02
  • @josefz, you can post that comment as an answer so I can mark the problem solved – BeastFromTheEast Jun 22 '21 at 06:09

2 Answers2

0

Based on the original question and the discussion in the comments, I suspect that you're just not saving the results of the conversion. Python strings are immutable, and so just making changes to a string that's passed into a function won't do anything to the original string:

In [42]: def change_string(s):
    ...:     s = "hello world"
    ...:
    ...: test_s = "still here"
    ...: change_string(test_s)
    ...: print(test_s)
still here

Instead, you'll want to return the results of the conversion in the function and reassign the variable:

In [43]: def change_string(s):
    ...:     s = s.encode('latin1').decode('u8')
    ...:     return s
    ...:
    ...: test_s = "\xc3\xa4\xc3\xa4abc"
    ...: test_s = change_string(test_s)
    ...: print(test_s)
ääabc
Randy
  • 14,349
  • 2
  • 36
  • 42
0

I can't reproduce your reading the soup message from an incoming webhook code snippet; therefore, my answer is based on hard-coded data, and shows how Python specific text encodings raw_unicode_escape and unicode_escape work in detail:

test_string = "\\xc3\\xa5\\xc3\\xa4___\xc3\xa5\xc3\xa4"    # hard-coded
print('test_string                  ', test_string)
print('.encode("raw_unicode_escape")',
  test_string.encode( 'raw_unicode_escape'))
print('.decode(    "unicode_escape")',
  test_string.encode( 'raw_unicode_escape').decode( 'unicode_escape'))
print('.encode("latin1").decode()   ', 
  test_string.encode( 'raw_unicode_escape').decode( 'unicode_escape').
              encode( 'latin1').decode( 'utf-8'))

Output: \SO\68069394.py

test_string                   \xc3\xa5\xc3\xa4___åä
.encode("raw_unicode_escape") b'\\xc3\\xa5\\xc3\\xa4___\xc3\xa5\xc3\xa4'
.decode(    "unicode_escape") åä___åä
.encode("latin1").decode()    åä___åä
JosefZ
  • 28,460
  • 5
  • 44
  • 83