1

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record. I tried escapeCharacter but not working. Input:

col1, col2, col3
a, b, c
a, b1 "b2, b3" b4, c
"a1, a2", b, c

Expected output for 2nd record is:

  1. a
  2. b1 "b2, b3" b4
  3. c

Actual output:

  1. a
  2. b1 "b2
  3. b3" b4
Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
  • 1
    Tried https://docs.python.org/3/library/csv.html ? – rasjani Jun 13 '22 at 15:15
  • Maybe this post abaut `skipinitialspace` can help: – Nil Pujol Porta Jun 13 '22 at 15:17
  • "I tried escapeCharacter but not working". So what exactly did you try? Read what all the various dialect attributes control [here](https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters). – chepner Jun 13 '22 at 15:18
  • 2
    The issue here is that the full field isn't escaped. Notice how the entire field containing the commas is escaped in your third record. Other than correcting the format of the input string in the first place (i.e. the second record should be `a, "b1 b2, b3 b4", c`) I know of no easy way to do this – Pranav Hosangadi Jun 13 '22 at 15:25

2 Answers2

0

Updated

There might be a better solution. But top of my head, I can only think of this approach.

If you see pattern, the splitted sub_string will always be next to each other. So, after splitting, we can combine consecutive sub_strings if there exists ".

sample_strings = [
    'col1, col2, col3',
    'a, b, c',
    'a, b1 "b2, b3, test, test1, test2, test3" b4, c',
    '"a1, a2", b, c',
]


for string in sample_strings:
    splitted = string.split(', ')
    result = []
    to_escape = -1
    for index, value in enumerate(splitted):
        if index <= to_escape:
            continue

        if '"' in value:
            value = value + ', '
            index = index + 1
            while '"' not in splitted[index]:
                value += splitted[index] + ', '
                index += 1
            value += splitted[index]
            result.append(value)
            to_escape = index 

        else:
            result.append(value)
    
    print(result)

Output:

['col1', 'col2', 'col3']
['a', 'b', 'c']
['a', 'b1 "b2, b3, test, test1, test2, test3" b4', 'c']
['"a1, a2"', 'b', 'c']
BhusalC_Bipin
  • 801
  • 3
  • 12
  • This doesn't work when there are multiple commas in the quotes (I know that's not what OP showed, but it's worth noting this shortcoming). You could try looking in _every_ element after finding an initial `"` until the matching closing quote is found – Pranav Hosangadi Jun 13 '22 at 15:49
  • Thank you for pointing that, I thought there will only be 2 items inside double quote as both the cases in his sample have only 2 items. I will update code. – BhusalC_Bipin Jun 13 '22 at 15:52
  • @PranavHosangadi I have updated the code, let me know, if you can find other issues. Thank you. – BhusalC_Bipin Jun 13 '22 at 16:14
-3

Any chance you could change the delimiter when creating the CSV files? I usually use semicolon instead of comma to avoid issues like that.

You can also tell python what's the delimiter in csv_reader = csv.reader(csv_file, delimiter=';')

xeparrelik
  • 11
  • 3