2

CSV file contains values such as "","ab,abc",,"abc". Note, I am referring to empty value ,, as in unknown value. This is different from "", where a value has not been set yet. I am treating these two values differently. I need a way to read "" and empty value ,, and distinguish between the two. I am mapping data to numbers such that "" is mapped to 0 and ,, is mapped to NaN. Note, I am not having a parsing issue and field such as "ab,abc" is being parsed just fine with comma as the delimiter. The issue is python reads "" and empty value,, as empty string such as ' '. And these two values are not same and should not be grouped into empty string.

Not only this, but I also need to write csv file such that "" is written as "" and not ,, and NaN should be written as ,, (empty value).

I have looked into csv Dialects such as doublequote, escapechar, quotechar, quoting. This is NOT what I want. These are all cases where delimiter appears within data ie "ab,abc" and as I mentioned, parsing with special characters is not an issue.

I don't want to use Pandas. The only thing I can think of is regex? But that's an overhead if I have millions of lines to process.

The behaviour I want is this:

a = "\"\"" (or it could be a="" or a="ab,abc")
if (a=="\"\""):
    map[0]=0
elif(a==""):
    map[0]=np.nan
else:
    map[0] = a

My csv reader is as follows:

import csv
f = open(filepath, 'r')
csvreader = csv.reader(f)
for row in csvreader:
        print(row)

I want above behaviour when reading csv files though. currently only two values are read: ' ' (empty string) or 'ab,abc'.

I want 3 different values to be read. ' ' empty string, '""' string with double quotes, and actual string 'ab,abc'

Naz
  • 189
  • 2
  • 10
  • what are you using to read the csv with? – Woody Pride Apr 19 '19 at 18:26
  • @WoodyPride import csv f = open(filepath, 'r') csvreader = csv.reader(f) – Naz Apr 19 '19 at 19:16
  • @Naz I'd suggest editing your question to include that, it's pretty unreadable in a comment... – Sam Mason Apr 19 '19 at 19:36
  • CSV files aren't terribly standardized, but RFC-4180 doesn't mention anything about a distinction between quoted and unquoted empty fields; is this just a distinction *you* want to make, or one that already exists for files you need to consume? – chepner Apr 19 '19 at 20:03
  • @chepner, No the file has such data. There exists a distinction between quoted empty field and unquoted empty field in my files. I just need python to map it to numbers. Also, to write csv files with exactly that distinction between quoted and unquoted empty field – Naz Apr 19 '19 at 20:31

3 Answers3

0

looking through the csv module in CPython source (search for IN_QUOTED_FIELD), it doesn't have any internal state that would let you do this. for example, parsing:

"a"b"c"d

is parsed as: 'ab"c"d', which might not be what you expect. e.g:

import csv
from io import StringIO

[row] = csv.reader(StringIO(
    '"a"b"c"d'))

print(row)

specifically, quotes are only handled specially at the beginning of fields, and all characters are just added to the field as they are encountered, rather than any allowing any special behaviour to be triggered when "un-quote"ing fields

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
0

The solution I figured is this:

If I change the input file such that quoted strings have escapechar '\' , below is input file:

col1,col2,col3
"",a,b
\cde \,f,g
,h,i
\j,kl\,mno,p

Then double-quoted empty field and unquoted empty field are separable

csvreader = csv.reader(f, quotechar='\\')
    for row in csvreader:
        print(row)

That's my best solution so far...

Naz
  • 189
  • 2
  • 10
0

When you want to be able to distinguish between empty strings and undefined values when reading a csv file in Python, you can achieve it in the following way:

import csv
import uuid
import sys

csv_filepath = sys.argv[1]
delimiter = ','
quotechar = '"'
empty_string_placeholder = str(uuid.uuid4())

with open(csv_filepath) as f:
    # Convert empty strings to a placeholder value
    lines = (
        line.replace(f'{quotechar}{quotechar}', empty_string_placeholder)
        for line in f.readlines())
    reader = csv.reader(lines, delimiter=delimiter, quotechar=quotechar)
    for row in reader:
        for i, cell in enumerate(row):
            if cell == '':
                # If the cell is an empty string, it means that it was None originally
                row[i] = None
            elif cell == empty_string_placeholder:
                # If the cell is equal to the placeholder, it means that it was an empty string originally
                row[i] = ''
            else:
                # Handle doubles quotes inside a cell
                row[i] = cell.replace(empty_string_placeholder, quotechar)
        print(row)

First, it performs a preprocessing step where it converts empty strings to a placeholder value. Next, it lets the csv module parse the input. Finally, when reading the values from the csv file, it uses the placeholder value to distinguish between empty string and undefined values.

I ran this script with the following csv file:

Id,Name,Description
1,,"Test"
,,
3,"Abc","Def"
,"Ghi",
5,"String ""with"" quotes","Test"

And it printed the following:

['Id', 'Name', 'Description']
['1', None, 'Test']
[None, None, None]
['3', 'Abc', 'Def']
[None, 'Ghi', None]
['5', 'String "with" quotes', 'Test']
Daniel Jonsson
  • 3,261
  • 5
  • 45
  • 66