0

I have a text file with a list of repeated names (some of which have accented alphabets like é, à, î etc.)

e.g. List: Précilia, Maggie, Précilia

I need to write a code that will give an output of the unique names.

But, my text file seems to have different character-encoding for the two accented é's in the two occurrences of Précilia (I am guess perhaps ASCII for one and UTF-8 for another). Thus my code gives both occurrences of Précilia as different unique elements. You can find my code below:

 seen = set()
 with open('./Desktop/input1.txt') as infile:
     with open('./Desktop/output.txt', 'w') as outfile:
         for line in infile:
             if line not in seen:
                 outfile.write(line)
                 seen.add(line)

Expected output: Prècilia, Maggie

Actual and incorrect output: Prècilia, Maggie, Prècilia

Update: The original file is a very large file. I need a way to consider both these occurrences as a single one.

  • you need to decode the input before comparing it. – thebjorn Jun 04 '19 at 00:32
  • Like I said, the file has different encodings. Also the original file is a much larger one. I tried decoding and the character è is decoded to different outputs of "é" and "eÌ", that's why they are considered unique. – Shritama Sengupta Jun 04 '19 at 00:39
  • Have a look at the [chardet](https://chardet.readthedocs.io/en/latest/index.html) library that helps you detect the encoding. That could then help you decode it appropriately. This can be a slow process – razdi Jun 04 '19 at 01:13
  • You can't compare byte streams in random encodings. You'll have to find a way to decode both into unicode before comparing... Btw "é" and "eÌ", are not decoded values. – thebjorn Jun 04 '19 at 09:09
  • Please try not decoding them and give us the byte sequences in question—just a few bytes for the relative substrings. And, you seem unsure of the character encoding of the file. Aside from doubts caused by your program, you should begin with a definite understanding of which character encoding the writer used. – Tom Blodget Jun 04 '19 at 16:51

1 Answers1

2

So my boss suggested we use Unicode Normalization which replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.

More details can be found on https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html and https://github.com/aws/aws-cli/issues/1639

As of now we got positive results on our test cases and hopefully our main data set will work with this too.