1

I have CSV files of scraped data and I wish to preprocess them by changing single quotation to double quotation. I have over 1000 of these characters ΓÇÖ as well as ΓÇ£ that I am unable to encode properly.

My current approach is it open the CSV file using CSV reader using encoding "cp437" (I have tried cp1251) before writing to a new csv file with encoding utf-8-sig (I have tried utf-8 as well, however more weird characters are showing up).

    with codecs.open(path, 'r', encoding='cp437') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=",")
        writer = csv.writer(open(path_2, 'w', encoding='utf-8-sig', newline=''))
        for row in csvreader:
                writer.writerow(row)

I saw in other threads that Notepad++ can be used to identify the encoding of the file. Interestingly, this method showed that both my raw CSV files are already UTF-8 encoded.

I'm confused as to how to handle this problem as these characters are reflected in the frontend to users.

These are some of the earlier errors I encountered should it be any help:

cells_1 = re.sub('\u201c', "'", cells)
cells_2 = re.sub('\u201d', "'", cells_1)
cells_3 =  re.sub("x92", '\u2019', cells_2)
cells_4 = re.sub('\u2019', "'", cells_3)
cells_5 =  re.sub('"', "'", cells_4)
cells_final =re.sub(' +', ' ', cells_5)
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
shafeeka
  • 11
  • 1
  • Does this help? https://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console -- I see the same character sequence in a comment there – canton7 Mar 15 '21 at 09:45
  • @canton7 hi, the thread was informative but did not solve my problem. I solved it by reading the file in utf-8 and writing as utf-8-sig. Nonetheless, thanks for replying! :) – shafeeka Mar 15 '21 at 11:37
  • It's a flagrant [mojibake](https://en.wikipedia.org/wiki/Mojibake) case: `'“ ”'.encode( 'utf-8').decode( 'cp437')` returns `'ΓÇ£ ΓÇ¥'` and vice versa `'ΓÇ£ ΓÇ¥'.encode( 'cp437').decode( 'utf-8')` returns `'“ ”'`. Please expand your [mcve] (what happens using `codecs.open(path, 'r', encoding='utf_8_sig')`?) – JosefZ Mar 16 '21 at 11:31

0 Answers0