2

I have a server written in FastAPI, Python 3.8.13 that receives data through a form from an external service, which may include Hebrew letters. Until recently, the data arrived through a back proxy server that was written in Python 2.7 and everything worked well. The proxy code was look something like this:

from requests import post
from bottle import route, request
@route('/', method='POST')
def new_req():
    post('https://....', dict(request.forms))
    return ''

The backend server code looks something like this:

@app.post('/....')
async def get_message(request:Request, message:str=Form(...)):
    print(message)

As long as the data arrived via the proxy, there were no problems with Hebrew letters. From the moment we asked the service to transfer the messages directly to the backend server, string like: 'שלום' were seen: 'שלו×'. The service declares that the data in Hebrew is sent in Unicode, so I tried to do something like this:

@app.post('/....')
async def get_message(request:Request, message:bytes=Form(...)):
    print(message)
    message = message.decode('utf-8')
    print(message)

The result (for: 'שלום'):

b'\xc3\x97\xc2\xa9\xc3\x97\xc2\x9c\xc3\x97\xc2\x95\xc3\x97\xc2\x9d'
ש×××

I tried replacing 'utf-8' with different encodings, international or Hebrew, and each time I got new kind of gibberish. Does anyone have an idea what else to try?

dan04
  • 87,747
  • 23
  • 163
  • 198
Peleg
  • 119
  • 1
  • 1
  • 10
  • 3
    What are the actual bytes before you try to decode them as `'utf-8'`? The string `'ש×××'` doesn't help much to see what byte values each of the `×` has. – Roland Illig Aug 30 '22 at 20:49
  • Thanks for the comment @RolandIllig. I have edited and completed this information. – Peleg Aug 30 '22 at 21:20

1 Answers1

3

This is a “double-UTF-8-encoded” string. That is, you start with the sequence of characters:

  1. U+05E9 HEBREW LETTER SHIN
  2. U+05DC HEBREW LETTER LAMED
  3. U+05D5 HEBREW LETTER VAV
  4. U+05DD HEBREW LETTER FINAL MEM

Encode them into UTF-8,

b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'

Misinterpret this byte string as being ISO-8859-1 encoded, producing the nonsense string

ש×××

And encode that string into UTF-8, which becomes the observed byte sequence.

b'\xc3\x97\xc2\xa9\xc3\x97\xc2\x9c\xc3\x97\xc2\x95\xc3\x97\xc2\x9d'

So, you need to figure out where this redundant encode operation is occurring, and find a way to tell your program “This bytes object is already UTF-8, so don't re-encode it.”

dan04
  • 87,747
  • 23
  • 163
  • 198
  • 1
    Since I use FastAPI I could not prevent the double coding, but I restored the original string as follows: `message = message.encode('ISO-8859-1').decode('utf-8')`. Thank you very much! – Peleg Aug 31 '22 at 15:03