I have CSV files of scraped data and I wish to preprocess them by changing single quotation to double quotation. I have over 1000 of these characters ΓÇÖ as well as ΓÇ£ that I am unable to encode properly.
My current approach is it open the CSV file using CSV reader using encoding "cp437" (I have tried cp1251) before writing to a new csv file with encoding utf-8-sig (I have tried utf-8 as well, however more weird characters are showing up).
with codecs.open(path, 'r', encoding='cp437') as csvfile:
csvreader = csv.reader(csvfile, delimiter=",")
writer = csv.writer(open(path_2, 'w', encoding='utf-8-sig', newline=''))
for row in csvreader:
writer.writerow(row)
I saw in other threads that Notepad++ can be used to identify the encoding of the file. Interestingly, this method showed that both my raw CSV files are already UTF-8 encoded.
I'm confused as to how to handle this problem as these characters are reflected in the frontend to users.
These are some of the earlier errors I encountered should it be any help:
cells_1 = re.sub('\u201c', "'", cells)
cells_2 = re.sub('\u201d', "'", cells_1)
cells_3 = re.sub("x92", '\u2019', cells_2)
cells_4 = re.sub('\u2019', "'", cells_3)
cells_5 = re.sub('"', "'", cells_4)
cells_final =re.sub(' +', ' ', cells_5)