1

I have a task to convert one CSV file from UTF8 encoding to ANSI encoding and format it. I have learned that ANSI is really the encoding of the system in some sense, but this is not the issue at the moment.

Before converting, I have to be able to read my CSV file first. The file contains null values and the headers are different from the rest of the file. Whenever I try to read it, I always get error regarding the NULL values or headers. The headers are different in a sense, that they do not have any quotation at all, but the rest of the file has 3 quotation marks on each side of strings (and for some reason also around NUL values). The file columns are coma separated and each row ends with a new line.

When I try to read with QUOTE_NONNUMERIC (to get rid of the null values), I get the error:

batch_data = list(reader)
ValueError: could not convert string to float: 'my_first_column_name'

When I try with QUOTE_ALL (in hopes to quote the headers as well), I get the error:

batch_data = list(reader)
_csv.Error: line contains NULL byte

Here is my code:

import csv

file_in = r'<file_in>'

with open(file_in, mode='r') as infile:
    reader = csv.reader(infile, quoting=csv.QUOTE_NONNUMERIC)
    batch_data = list(reader)
    for row in batch_data:
        print(row, end="")

After reading some materials I understand that, I guess, I have to read headers separately from the rest of the file. How would one do it exactly? I was trying to skip them with reader.next(), but then I get the error, that next is not a method of reader. At this point I can't really believe that it has already taken so much of my time reading through the documentations and trying different things.

SOLUTION

I ended up using list(next()) to skip over the header and then replace all the null values with empty string. Then opening the output file with the configuration I needed and writing my rows in it. I do still have to deal with different formatting since this solution puts double quotes around every cell and I need some to be numbers. This also replaced all single quotes from sells to double quotes, which still has to be fixed. file_in and file_out variables have assigned the input file and output file locations to them.

import csv
import json

file_in = r'<input_filepath>'
file_out = r'<output_filepath>'


with open(file_in, mode='r', encoding='utf-8-sig') as infile:
    reader = csv.reader(infile, quoting=csv.QUOTE_ALL)
    header_data = list(next(reader))
    infile = infile.read().replace('\0', '').splitlines()
    reader2 = csv.reader(infile)
    with open(file_out,'w',newline='', encoding='cp1252') as outfile:
        writer = csv.writer(outfile, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
        writer.writerow(header_data)
        for row in reader2:
            row = str(row).replace('"','').replace("'",'"')
            row = json.loads(row)
            writer.writerow(row)
Oskars
  • 407
  • 4
  • 24
  • 1
    The files aren't really in CSV format based on your description of them, which is why the `csv.reader` has problems dealing with them. I doubt anyone will be able to help much without seeing a sample of the input files, so you should either add that to your question or make a sample file available via download from somewhere. That said, you can skip rows using `next(reader)`. – martineau Nov 12 '21 at 12:02
  • 1
    @martineau you, sir, have made my day! I can loop my data with next(reader) and process the headers with different quotation. – Oskars Nov 12 '21 at 13:12
  • 1
    That's good to hear. In Python 2 the `csv.reader` *did* have a `next()` method but that's not true in Python 3. However calling the built-in `next()` function on one works in both versions, so I quickly learned to always do it that way. – martineau Nov 12 '21 at 13:20
  • Oskars, seems like @martineau helped out. To "close" this question, please provide your own answer below with the code you changed to make this work for you. I'm especially curious to see how you handled the quoting and NULLs. – Zach Young Nov 12 '21 at 19:54
  • 1
    @ZachYoung I am still trying to handle it, but now thanks to martineau I have a direction :) I will post my answer once I have it figured out fully – Oskars Nov 15 '21 at 13:16

0 Answers0