2

In Python 3 I'm using the csv package

import csv

to read sparse comma-separated CSV files. Among the elements are some that are empty strings

...,"",...

and some that are null

...,,...

I'd like to distinguish between those. How can one do that?

martineau
  • 119,623
  • 25
  • 170
  • 301
djc
  • 21
  • 1
  • 3
  • 2
    Why? By CSV quoting rules, there is no difference between the two. – chepner Oct 02 '19 at 15:02
  • Possible duplicate of [csv reader behavior with None and empty string](https://stackoverflow.com/questions/11379300/csv-reader-behavior-with-none-and-empty-string) – gstukelj Oct 02 '19 at 15:32
  • Why? Because the process that produces the CSV is outside my control, and it differentiates between the empty string value and the null value. I'd like to use that difference. (And I do, but by pre- and post-processing outside Python.) To quote Wikipedia, "The CSV file format is not fully standardized." No kidding! – djc Oct 03 '19 at 16:36

3 Answers3

1

If you substitute a safe placeholder (one that won't match an existing string in the csv file) for a bare pair of double quotes ("") before parsing, you can then substitute back once parsed and identify strings that have a single occurrence of the placeholder:

import csv
lines = [
    '"row1_col1","row1_col2 with ""embedded"" quotes",3,"",,"row1_col6"',
    '"row2_col1","row2_col2",3,"",,"row2_col6"',
]
lines = [line.replace('""', '__PAIR_OF_DOUBLE_QUOTES__') for line in lines]
csv_reader = csv.reader(lines, delimiter=',')
rows = []
for row in csv_reader:
    for col in range(len(row)):
        # empty string is null/none
        if row[col] == '':
            row[col] = None
        # string with just a pair of double quotes is the empty string
        elif row[col] == '__PAIR_OF_DOUBLE_QUOTES__':
            row[col] = ''
        else:
            row[col] = row[col].replace('__PAIR_OF_DOUBLE_QUOTES__', '"')
    rows.append(row)

This results in the following output:

>>> print(json.dumps(rows))
[
    ["row1_col1", "row1_col2 with \"embedded\" quotes", "3", "", null, "row1_col6"],
    ["row2_col1", "row2_col2", "3", "", null, "row2_col6"]
]
ZachL
  • 171
  • 1
  • 6
  • Essentially rewrite the file before reading it as CSV? That's about what the current pre-processing does, but it has to be a bit smart, because the contents of some cells are wildly unpredictable. – djc Sep 30 '20 at 15:38
  • You're right - it is probably safest to use a uuid4 rather than a display-friendly placeholder. It would be nearly impossible to have collisions that way. – ZachL Oct 02 '20 at 17:42
0

This answer suggests it is not possible using the csv module. However, pandas read_csv() has a number of arguments that will allow you to distinguish between NULL and "".

Jan Morawiec
  • 425
  • 2
  • 11
0

I gather that this simply isn't possible with the csv module, so I'll continue to pre-process CSV files to insert a dummy value for "" (I sure love me some Emacs), then do my deeds with Python, detecting the dummy value and handling it as the empty string and empty values as null. Problem solved, by renting a place to live in Kludge City.

djc
  • 21
  • 1
  • 3