urdu strings looking same but in comparison found unequal python3

Question

In my application, I've list of (Urdu) words in text file, (currently single word like this)

and I've another text file having string of urdu (currently single word like this and exactly same)

Now I need to find if string file's string has any word that exists in word's file. For this, I'm reading both file into lists like this;

// reading text file of strings...

fileToRead = codecs.open('string.txt', mode, encoding=encoding)
fileData = fileToRead.read()
lstFileData = fileData.split('\n')


wordListToRead = codecs.open('words.txt', mode, encoding=encoding)
wordData = wordListToRead.read()
lstWords = wordData.split('\n')

I'm simply traversing list like this;

for string in lstFileData:
    if string in lstWords:
        // do further work

and its not working And I don't know Why? Although string is 'فلسفے' and lstWords has this string in it. Do I need to add some encoding? Any kind of help will be appreciated.

it should work as it is, you better add a bit more code – Andrii Maletskyi Oct 06 '18 at 14:40 — Andrii Maletskyi, Oct 06 '18 at 14:40
okay let me add it in detail. – Naila Akbar Oct 06 '18 at 14:52 — Naila Akbar, Oct 06 '18 at 14:52
please check updated question – Naila Akbar Oct 06 '18 at 15:06 — Naila Akbar, Oct 06 '18 at 15:06

golddove · Answer 1 · 2018-10-07T03:34:25.070

1

Just tried it out in python3 and it seems to work for me:

lstWords = ['a', 'فلسفے', 'b']
string = 'فلسفے'
if string in lstWords:
    print("yes")

Edit: Again, just tested your updated code with file IO and it works fine (I did not specify an encoding). Here is a link of it working: https://trinket.io/python3/3890d8b261

edited Oct 07 '18 at 03:34

answered Oct 06 '18 at 14:40

golddove

1,165
2
14
32

Yes, it is exactly the same thing and It is assumed to work but it is not. – Naila Akbar Oct 06 '18 at 14:49
1

I think there is something else going on. Look at the link in my updated answer to see the code working just fine in python3. – golddove Oct 07 '18 at 03:35
Yes.. Issue was in file. I opened it in notepad and updated..and this thing changed it from utf-8 to utf-8 BOM. I guessed that was making issue. Once I made new file in notepad++ and saved it as utf-8. Same code started working fine. – Naila Akbar Oct 07 '18 at 12:49

score 0 · Accepted Answer · answered Oct 07 '18 at 12:52

0

May be it helped out someone like me

Although it sounds like fun but Issue was in file encoding type. I opened up file in simple notepad to make some changes and saved it. It changed my file from utf-8 to utf-8 BOM. And my code wasn't working on it. Once I created new file in notepad++ in utf-8, Same code started working fine. (Because issue was not in code, it was in file encoding)

answered Oct 07 '18 at 12:52

Naila Akbar

3,033
4
34
76

It might well help future readers, but it's quite unlikely to be found with the current, very specific title; consider to change it to something more general like "strings looking the same compare unequal" or similar. Btw: the correct encoding to open files with UTF8 BOM is called "utf-8-sig" in Python. Otherwise (if you decode with "utf-8") the BOM character will stick to the beginning of the content. – lenz Oct 07 '18 at 13:12

urdu strings looking same but in comparison found unequal python3

2 Answers2