2

Suppose I have the following files in path, which is in my Google drive that is connected to a Python 3 Colab notebook:

(Here, the # line represents the output)

ls = os.listdir(path)
print (ls)
# ['á.csv', 'b.csv']

Every seems ok, but if I write

'á.csv' in ls
# False

But should returns True. However, if I repeat the last code, but instead of writing 'á.csv' I copy-paste it manually from print (ls), it returns True.

Thanks

ps: The problem is not exactly with that filename, is with several filenames which contains special characters (namely í, á, é, ó, ñ)

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
felipekare
  • 23
  • 4

2 Answers2

2

You can normalize the file list before comparing them.

from unicodedata import normalize
ls = [normalize('NFC', f) for f in os.listdir(path)]
# compare
normalize('NFC', 'á.csv') in ls
# or just 'á.csv' in ls
korakot
  • 37,818
  • 16
  • 123
  • 144
1

I believe it is because some diacritic characters in Unicode have duplicates. That is, while some characters appear identical, they may be different characters with different codes. Try 'á'.encode() once by writing á and once again by copy-pasting as you did. If the bytes look different, that's because they are different characters that look identical.

Hurried-Helpful
  • 1,850
  • 5
  • 15
  • you're right! written ```'í'.encode()``` returns ```\xc3\xad``` and copied returns ```i\xcc\x81``` Now How could I fix it? Both are in utf-8, however one codes the í and the other for the accent mark. Also in my offline Python it replicate the problem (I don't know why before it doesn't but I will edit my post) so maybe it is something of ```listdir``` – felipekare Feb 01 '20 at 01:13
  • You can get rid of all accents from the file names, so long as there are no collisions. Look [here](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) to find out how. – Hurried-Helpful Feb 01 '20 at 03:05