I have a task to convert one CSV file from UTF8 encoding to ANSI encoding and format it. I have learned that ANSI is really the encoding of the system in some sense, but this is not the issue at the moment.
Before converting, I have to be able to read my CSV file first. The file contains null values and the headers are different from the rest of the file. Whenever I try to read it, I always get error regarding the NULL values or headers. The headers are different in a sense, that they do not have any quotation at all, but the rest of the file has 3 quotation marks on each side of strings (and for some reason also around NUL values). The file columns are coma separated and each row ends with a new line.
When I try to read with QUOTE_NONNUMERIC
(to get rid of the null values), I get the error:
batch_data = list(reader)
ValueError: could not convert string to float: 'my_first_column_name'
When I try with QUOTE_ALL
(in hopes to quote the headers as well), I get the error:
batch_data = list(reader)
_csv.Error: line contains NULL byte
Here is my code:
import csv
file_in = r'<file_in>'
with open(file_in, mode='r') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_NONNUMERIC)
batch_data = list(reader)
for row in batch_data:
print(row, end="")
After reading some materials I understand that, I guess, I have to read headers separately from the rest of the file. How would one do it exactly? I was trying to skip them with reader.next()
, but then I get the error, that next
is not a method of reader
. At this point I can't really believe that it has already taken so much of my time reading through the documentations and trying different things.
SOLUTION
I ended up using list(next()) to skip over the header and then replace all the null values with empty string. Then opening the output file with the configuration I needed and writing my rows in it. I do still have to deal with different formatting since this solution puts double quotes around every cell and I need some to be numbers. This also replaced all single quotes from sells to double quotes, which still has to be fixed. file_in
and file_out
variables have assigned the input file and output file locations to them.
import csv
import json
file_in = r'<input_filepath>'
file_out = r'<output_filepath>'
with open(file_in, mode='r', encoding='utf-8-sig') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_ALL)
header_data = list(next(reader))
infile = infile.read().replace('\0', '').splitlines()
reader2 = csv.reader(infile)
with open(file_out,'w',newline='', encoding='cp1252') as outfile:
writer = csv.writer(outfile, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(header_data)
for row in reader2:
row = str(row).replace('"','').replace("'",'"')
row = json.loads(row)
writer.writerow(row)