1

The CSV file looks like this:

Actual data format

id,account,company,total_emp,country,state
1,12345,ABC,100,US,CA
2,32345,CCD,inc,100,US,USA,NA   
3,42345,ABC,LLC,100,US,WA
4,52345,DDM,100,CA,US,OR  
5,62345,TSL,100,US,UT
....

Easy to view

id, account, company, total_emp, country,  state
1,   12345,   ABC,     100,        US,       CA
2,   32345,   CCD,inc  100,        US,USA,   
3,   42345,   ABC,LLC  100,        US,       WA
4,   52345,   DDM,     100,        CA, US,   OR  
5,   62345,   TSL,     100,        US,       UT
....

Things to note:

  • The column values for company do not have any quotation.
  • There are other columns after 'company'.
  • There are 2 columns with a comma issue.

Here, on id 2 and 3, we can see that the values are not in the quote but it has multiple commas. I am getting the following error: ValueError: 5 columns passed, passed data had 6 columns.

What did I do?

  • unique issue: I tried to offset the value using readlines() but since I have a column after the company column, it did not work.

Could someone please help me solve this problem?

Shaido
  • 27,497
  • 23
  • 70
  • 73
floss
  • 2,603
  • 2
  • 20
  • 37
  • 1
    Does this answer your question? [How do I read a fix width format text file in pandas](https://stackoverflow.com/questions/9721429/how-do-i-read-a-fix-width-format-text-file-in-pandas) – smci Mar 11 '21 at 01:06
  • 1
    actually no, I added the data in fixed-width to make it easier to understand - fixed the issue to actual format – floss Mar 11 '21 at 01:09
  • 1
    @floss: Could you check line 2 and 3? It looks like there is no comma before 100 (line 3) and it's missing from line 2 as compared to the previous version. – Shaido Mar 11 '21 at 01:13
  • Ok you edited your initial MCVE to make clear it's actually not fixed-width format (fwf). So you have separator characters (commas) occurring inside a field (`company`). Can you generate the incoming CSV with escaping? Then you could just read it in with escaping. – smci Mar 11 '21 at 01:15
  • What is `NA` on line 2? Why is it missing total employees? – Barmar Mar 11 '21 at 01:15
  • @Shaido I just fixed that @Barmar the value is `NA` – floss Mar 11 '21 at 01:15
  • Why is `NA` at the end of the line instead of before country? – Barmar Mar 11 '21 at 01:16
  • @smci that is not possible. I have to do it on my own. – floss Mar 11 '21 at 01:16
  • @Barmar the reason `state` is `NA` because no value is provided. It is actually empty in CSV. – floss Mar 11 '21 at 01:17
  • If you only have a finite well-known list of abbreviations (do you?), like `Company, Inc`, you could wrap the CSV input and do search-and-replace for those and replace with a different or else escaped character. – smci Mar 11 '21 at 01:18
  • Where is `total_emp` on that line? – Barmar Mar 11 '21 at 01:18
  • @smci actual CSV is more than 150K – floss Mar 11 '21 at 01:19
  • @Barmar `total_emp` is int data type. `No of Employees` – floss Mar 11 '21 at 01:19
  • Line 2 doesn't have an integer in the total_emp field, it has `US` there. – Barmar Mar 11 '21 at 01:20
  • But do you have a a finite well-known list of abbreviations? Grep for them and tell us how many you have? (just search for 6+ commas in a line. How many of those are there? – smci Mar 11 '21 at 01:20
  • @Barmar I updated it and added data :) – floss Mar 11 '21 at 01:23
  • Who designed this data format? How do they expect anyone to process it when there's no clear way to tell where each field ends? – Barmar Mar 11 '21 at 01:24
  • Maybe whoever created it has a library for parsing it, because it seems like there are different rules for each line. – Barmar Mar 11 '21 at 01:25
  • @Barmar it would have been in the same column if it was in quotes but CSV export didnt put it like that :( – floss Mar 11 '21 at 01:26
  • What kind of CSV export doesn't put quotes around fields that have embedded commas? Maybe you didn't specify the correct options, and need to do it again with proper settings. – Barmar Mar 11 '21 at 01:28
  • @Barmar I didnt do it :) I was sent the CSV file lol – floss Mar 11 '21 at 01:29
  • 1
    tell them they screwed up and need to send it again. This file format is completely unusable. – Barmar Mar 11 '21 at 01:31

3 Answers3

1

What you can do is to read the file line-by-line. For any correctly formatted lines, the logic is easy but for those with extra commas, the logic is more complex. The id, account and state columns can be found directly. To get the company value, we can try to find which column is total_emp as that one is numeric. In the below code, I assume that company contains at most 1 extra comma, however, the code can be extended to work for more.

After company is found, total_emp is easy and country will then be the rest of the values.

with open('a.csv') as f:
    for line in f.readlines()[1:]:  # ignore the header
        if line.count(',') == 5:
            id, account, company, total_emp, country, state = line.strip().split(',')
        else:
            fields = line.strip().split(',')
            id, account = fields[:2]
            state = fields[-1]

            if fields[3].isnumeric():
                company = fields[2]
                marker = 3
            else:
                company = ','.join(fields[2:4])
                marker = 4
            total_emp = fields[marker]
            country = ','.join(fields[-len(fields)+marker+1:-1])

Printing the result as follows:

print('id:', id, 'account:', account, 'company:', company, 'total_emp:', total_emp, 'country:', country, 'state:', state)

yields:

id: 1 account: 12345 company: ABC total_emp: 100 country: US state: CA
id: 2 account: 32345 company: CCD,inc total_emp: 100 country: US,USA state: NA
id: 3 account: 42345 company: ABC,LLC total_emp: 100 country: US state: WA
id: 4 account: 52345 company: DDM total_emp: 100 country: CA,US state: OR
id: 5 account: 62345 company: TSL total_emp: 100 country: US state: UT
Shaido
  • 27,497
  • 23
  • 70
  • 73
1

How about use regex first to format the data.

import re

f = open("file.csv", "r")
f.readline()
data = []
for line in f.readlines():
    data.append(re.match(r"(?P<id>\d+),(?P<account>\d+),(?P<company>.*),(?P<total_emp>\d+),(?P<country>.+),(?P<state>.+)$", line).groupdict())
print(data)

Then, you will get the data you need.

[   {   'account': '12345',
        'company': 'ABC',
        'country': 'US',
        'id': '1',
        'state': 'CA',
        'total_emp': '100'},
    {   'account': '32345',
        'company': 'CCD,inc',
        'country': 'US,USA',
        'id': '2',
        'state': 'NA',
        'total_emp': '100'},
    {   'account': '42345',
        'company': 'ABC,LLC',
        'country': 'US',
        'id': '3',
        'state': 'WA',
        'total_emp': '100'},
    {   'account': '52345',
        'company': 'DDM',
        'country': 'CA,US',
        'id': '4',
        'state': 'OR',
        'total_emp': '100'},
    {   'account': '62345',
        'company': 'TSL',
        'country': 'US',
        'id': '5',
        'state': 'UT',
        'total_emp': '100'}]

After that, you can use pandas:

In [4]: pandas.DataFrame(data).set_index("id")
Out[4]:
   account  company country state total_emp
id
1    12345      ABC      US    CA       100
2    32345  CCD,inc  US,USA    NA       100
3    42345  ABC,LLC      US    WA       100
4    52345      DDM   CA,US    OR       100
5    62345      TSL      US    UT       100
ramwin
  • 5,803
  • 3
  • 27
  • 29
0

Split the line on commas, then use slices to get the fields relative to the beginning and end of the line. The company name is everything between them.

for line in file:
    fields = line.strip().split(',')
    id, account = fields[:2]
    total_emp, country, state = fields[-3:]
    company = ','.join(fields[3:-3])
    # rest of code

This would be much harder if you had multiple fields that could have embedded commas.

EDIT After I wrote this answer you changed the file so it has multiple countries separated by commas. Now I don't have any good suggestions.

Barmar
  • 741,623
  • 53
  • 500
  • 612