How to fix extra comma column offset CSV using Pandas?

Question

The CSV file looks like this:

Actual data format

id,account,company,total_emp,country,state
1,12345,ABC,100,US,CA
2,32345,CCD,inc,100,US,USA,NA   
3,42345,ABC,LLC,100,US,WA
4,52345,DDM,100,CA,US,OR  
5,62345,TSL,100,US,UT
....

Easy to view

id, account, company, total_emp, country,  state
1,   12345,   ABC,     100,        US,       CA
2,   32345,   CCD,inc  100,        US,USA,   
3,   42345,   ABC,LLC  100,        US,       WA
4,   52345,   DDM,     100,        CA, US,   OR  
5,   62345,   TSL,     100,        US,       UT
....

Things to note:

The column values for company do not have any quotation.
There are other columns after 'company'.
There are 2 columns with a comma issue.

Here, on id 2 and 3, we can see that the values are not in the quote but it has multiple commas. I am getting the following error: ValueError: 5 columns passed, passed data had 6 columns.

What did I do?

unique issue: I tried to offset the value using readlines() but since I have a column after the company column, it did not work.

Could someone please help me solve this problem?

Does this answer your question? [How do I read a fix width format text file in pandas](https://stackoverflow.com/questions/9721429/how-do-i-read-a-fix-width-format-text-file-in-pandas) — smci, Mar 11 '21 at 01:06
actually no, I added the data in fixed-width to make it easier to understand - fixed the issue to actual format — floss, Mar 11 '21 at 01:09
@floss: Could you check line 2 and 3? It looks like there is no comma before 100 (line 3) and it's missing from line 2 as compared to the previous version. — Shaido, Mar 11 '21 at 01:13
Ok you edited your initial MCVE to make clear it's actually not fixed-width format (fwf). So you have separator characters (commas) occurring inside a field (`company`). Can you generate the incoming CSV with escaping? Then you could just read it in with escaping. — smci, Mar 11 '21 at 01:15
Why is `NA` at the end of the line instead of before country? — Barmar, Mar 11 '21 at 01:16
@Barmar the reason `state` is `NA` because no value is provided. It is actually empty in CSV. — floss, Mar 11 '21 at 01:17
If you only have a finite well-known list of abbreviations (do you?), like `Company, Inc`, you could wrap the CSV input and do search-and-replace for those and replace with a different or else escaped character. — smci, Mar 11 '21 at 01:18
Line 2 doesn't have an integer in the total_emp field, it has `US` there. — Barmar, Mar 11 '21 at 01:20
But do you have a a finite well-known list of abbreviations? Grep for them and tell us how many you have? (just search for 6+ commas in a line. How many of those are there? — smci, Mar 11 '21 at 01:20
Who designed this data format? How do they expect anyone to process it when there's no clear way to tell where each field ends? — Barmar, Mar 11 '21 at 01:24
Maybe whoever created it has a library for parsing it, because it seems like there are different rules for each line. — Barmar, Mar 11 '21 at 01:25
@Barmar it would have been in the same column if it was in quotes but CSV export didnt put it like that :( — floss, Mar 11 '21 at 01:26
What kind of CSV export doesn't put quotes around fields that have embedded commas? Maybe you didn't specify the correct options, and need to do it again with proper settings. — Barmar, Mar 11 '21 at 01:28
tell them they screwed up and need to send it again. This file format is completely unusable. — Barmar, Mar 11 '21 at 01:31

score 1 · Answer 1 · answered Mar 11 '21 at 01:46

What you can do is to read the file line-by-line. For any correctly formatted lines, the logic is easy but for those with extra commas, the logic is more complex. The id, account and state columns can be found directly. To get the company value, we can try to find which column is total_emp as that one is numeric. In the below code, I assume that company contains at most 1 extra comma, however, the code can be extended to work for more.

After company is found, total_emp is easy and country will then be the rest of the values.

with open('a.csv') as f:
    for line in f.readlines()[1:]:  # ignore the header
        if line.count(',') == 5:
            id, account, company, total_emp, country, state = line.strip().split(',')
        else:
            fields = line.strip().split(',')
            id, account = fields[:2]
            state = fields[-1]

            if fields[3].isnumeric():
                company = fields[2]
                marker = 3
            else:
                company = ','.join(fields[2:4])
                marker = 4
            total_emp = fields[marker]
            country = ','.join(fields[-len(fields)+marker+1:-1])

Printing the result as follows:

print('id:', id, 'account:', account, 'company:', company, 'total_emp:', total_emp, 'country:', country, 'state:', state)

yields:

id: 1 account: 12345 company: ABC total_emp: 100 country: US state: CA
id: 2 account: 32345 company: CCD,inc total_emp: 100 country: US,USA state: NA
id: 3 account: 42345 company: ABC,LLC total_emp: 100 country: US state: WA
id: 4 account: 52345 company: DDM total_emp: 100 country: CA,US state: OR
id: 5 account: 62345 company: TSL total_emp: 100 country: US state: UT

score 1 · Answer 2 · answered Mar 11 '21 at 02:06

How about use regex first to format the data.

import re

f = open("file.csv", "r")
f.readline()
data = []
for line in f.readlines():
    data.append(re.match(r"(?P<id>\d+),(?P<account>\d+),(?P<company>.*),(?P<total_emp>\d+),(?P<country>.+),(?P<state>.+)$", line).groupdict())
print(data)

Then, you will get the data you need.

[   {   'account': '12345',
        'company': 'ABC',
        'country': 'US',
        'id': '1',
        'state': 'CA',
        'total_emp': '100'},
    {   'account': '32345',
        'company': 'CCD,inc',
        'country': 'US,USA',
        'id': '2',
        'state': 'NA',
        'total_emp': '100'},
    {   'account': '42345',
        'company': 'ABC,LLC',
        'country': 'US',
        'id': '3',
        'state': 'WA',
        'total_emp': '100'},
    {   'account': '52345',
        'company': 'DDM',
        'country': 'CA,US',
        'id': '4',
        'state': 'OR',
        'total_emp': '100'},
    {   'account': '62345',
        'company': 'TSL',
        'country': 'US',
        'id': '5',
        'state': 'UT',
        'total_emp': '100'}]

After that, you can use pandas:

In [4]: pandas.DataFrame(data).set_index("id")
Out[4]:
   account  company country state total_emp
id
1    12345      ABC      US    CA       100
2    32345  CCD,inc  US,USA    NA       100
3    42345  ABC,LLC      US    WA       100
4    52345      DDM   CA,US    OR       100
5    62345      TSL      US    UT       100

Barmar · Answer 3 · 2021-03-11T01:26:42.500

Split the line on commas, then use slices to get the fields relative to the beginning and end of the line. The company name is everything between them.

for line in file:
    fields = line.strip().split(',')
    id, account = fields[:2]
    total_emp, country, state = fields[-3:]
    company = ','.join(fields[3:-3])
    # rest of code

This would be much harder if you had multiple fields that could have embedded commas.

EDIT After I wrote this answer you changed the file so it has multiple countries separated by commas. Now I don't have any good suggestions.

How to fix extra comma column offset CSV using Pandas?

3 Answers3