0

I'm trying to remove duplicates from a csv file with a lot of data. The removal works as intended but I can't seem to figure out how to change encoding on inplace removal. Googling for an answer didn't help. Any of you got a suggestion?

This is my code:

seen = set()
for line in fileinput.FileInput('Dupes.csv', inplace=1):
    if line in seen: continue # skip duplicated line
    seen.add(line)
    print(line, end='') 
Rainoa
  • 491
  • 1
  • 4
  • 14
  • I have tried using openhook, but it isn't allowed on inplace files. – Rainoa May 19 '17 at 18:44
  • The identation is off on the code. What encoding and where do you want to encode something? It's a little bit unclear what your actual problem is. At least to me. – Torxed May 19 '17 at 18:47
  • The statement `print(line, end='')` encodes the text using the current default encoding of your runtime. Maybe you simply want to encode it there? `print(line.encode('utf-8'), end='')` – dsh May 19 '17 at 18:53
  • @Torxed Fixed the indentation error. I'm trying to use 'Cp1252' encoding. My code works as intended but the encoding messes up danish letters. – Rainoa May 19 '17 at 19:16
  • @dsh Tried your solution but it completely removes the all data for some reason :/ – Rainoa May 19 '17 at 19:17

1 Answers1

2

This script works fine with me.

import fileinput
import sys

encoding = 'utf8'
end = '\n'

seen = set()
dupeCount = 0

for line in fileinput.FileInput('Dupes.csv', inplace=1, mode='rU'):
    stripped = line.strip()
    if stripped in seen:
        dupeCount += 1
        continue
    seen.add(stripped)

    # Sends the output in the right representation
    sys.stdout.buffer.write(stripped.encode(encoding) + end.encode(encoding))

print('Removed %d dupes' % dupeCount)

The idea is to read the file with the right mode, and then write to the file thru stdout in the correct encoding, which is done by writing everything in the utf8's byte representation.

Tested with accents, seems to work.

WKnight02
  • 182
  • 1
  • 11
  • Also, if you are checking the file content with notepad++, don't forget to switch from whatever encoding you were using to the correct utf8 encoding. Beware of converting it instead of changing the current reading encoding. Use `Encode as UTF-8 (Without BOM)` rather than `Convert to UTF-8 (...)` – WKnight02 May 19 '17 at 20:12
  • Test your code and I am still getting the same error :/ @WKnight02 Tried playing around with notepad++. I actually get the right encoding checking with notepad++ but libre office is giving me issues... Not sure what to conclude of this – Rainoa May 19 '17 at 20:57
  • Oh shit i feel dumb. Didn't realise that the issue had nothing to do with encoding, but actually my libre office settings. Jesus... Should i delete the thread or what should i do from here? – Rainoa May 19 '17 at 21:00
  • Well, one rock two birds I guess ! You now know how to get the correct encoding with FileInput, and have a correct LibreOffice setting :) Maybe just post an answer detailing what you did to solve your problem, and mark it as `answered`. I don't really know :) – WKnight02 May 19 '17 at 21:41
  • 1
    Cheers for the help. Your comment on checking encoding with notepad++ made me look at my libre office settings :) Appreciated! – Rainoa May 20 '17 at 01:25
  • The code has no problem outputting a UTF-8 character. However, it has trouble reading UTF-8 characters making the the original input text that contains UTF-8 characters mutated into other random characters. – sasawatc Feb 14 '19 at 20:23
  • 1
    mode='rU' solved an issue I had rewriting files that had brackets [ ] in them. thanks! – iPzard Oct 20 '20 at 04:29