To Understand More about Encoding U+30C9 vs U+30C8U+3099

Question

ド(U+30C9) vs ド(U+30C8U+3099)

Fyi, the situation is

a user uploaded a file with name containing ド(U+30C8U+3099) to AWS s3 from a web app.
Then, the website sent a POST request containing the file name without url encoding to AWS lambda function for further processing using Python. The name when arrived in Lambda became ド(U+30C9). Python then failed to access the file stored in s3 because of the difference in unicode.

I think the solution would be to do url encoding on frontend before sending the request and do url decoding using urllib.parse.unquote to have the same unicode.

My questions are

would url encoding solve that issue? I can't reproduce the same issue probably because I am on a different OS from the user's OS.
How exactly did it happen since both requests (uploading to s3 and sending the 2nd request to lambda) happened on the user's machine?

Thank you.

Related: [Understanding unistr of unicodedata.normalize()](https://stackoverflow.com/questions/59979037/) — JosefZ, Jan 29 '22 at 10:50

score 1 · Accepted Answer · answered Jan 29 '22 at 06:09

Your are hitting a common case (maybe more common in Latin scripts): canonical equivalence. Unicode requires to handle canonical equivalent sequences in the same manner.

If you look in UnicodeData.txt you will find:

30C8;KATAKANA LETTER TO;Lo;0;L;;;;;N;;;;;
30C9;KATAKANA LETTER DO;Lo;0;L;30C8 3099;;;;N;;;;;

so, 30C9 is canonical equivalent to 30C8 3099.

Usually, it is better to normalize Unicode strings to a common canonical form. Unfortunately we have two of them: NFC and NFD: Normalization Form Canonical Composition and Normalization Form Canonical Decomposition. Apple prefers the later (and Unicode original design/preference is about this form), and most of the other vendors the first.

So do no trust web browsers to keep the same form. But also consider that input methods on user side may give you different variations (and with keyboards you may have also non-canonical forms which should be normalized [this can happens with several combining characters]).

So, on your backend you should choose a normalization form, and transform all input data in such form (or just be sure that all search and comparing functions can handle equivalent sequences correctly, but this requires a normalization on every call, so it may be less efficient).

Python has unicodedata.normalize() (in standard library, see unicodedata module), to normalize Unicode strings. Eventually on other languages you should use ICU library. In any case, you should normalize Unicode strings.

Note: this has nothing about encoding, but it is in built directly in Unicode design. The reason is about requirement to be compatible with old encoding and old encodings had both ways to describe the same characters.

I see. Thanks for the explanation. I guess I should do normalization using JavaScript in frontend too since the file input is directly sent to a Cloud Service that I have no control of? Then, I can use the same canonical form on backend to retrieve it from the Cloud Service storage. — Jun, Jan 31 '22 at 20:57
I think it is a sensible way. Just consider security implication (maybe there are none), that front-end cannot reach all pages on back-end, or that some clients makes inconsistent calls (authorization in one form, and connection to the other form) — Giacomo Catenazzi, Feb 01 '22 at 09:00

To Understand More about Encoding U+30C9 vs U+30C8U+3099

1 Answers1