Python: Split CSV file according to first character of the first column

Question

I have a series of large CSV files "basename.csv" like:

B1,3,5,6

B2,2,1,5

B3,1,9,0

C1,4,7,9

C2,1,9,3

C3,8,5,2

I would like to split them into different files like:

basename_B.csv

B1,3,5,6

B2,2,1,5

B3,1,9,0

basename_C.csv

C1,4,7,9

C2,1,9,3

C3,8,5,2

I have already done similar things in the past with for loops and ifs, but I was wondering if there is a more efficient way of doing this with Pandas or whatever.

SOLUTION

Adapting the solution from @chthonicdaemon and @jezrael, I came up with this:

def split_csv():
    for dfile in glob.glob('*.csv'):
        df = pd.read_csv(dfile, header=None)
        for letter, group in df.groupby(df[0].str[0]):
            group.to_csv((os.path.splitext(dfile)[0]) + '_{}.csv'.format(letter), index=False, header=False)

split_csv()

You can do `df.ix[:,df.columns.str.startswith('B')]` to filter the cols — EdChum, Nov 24 '16 at 14:53
Yes, thank you, the thing is that I do not know which letters are there present in each file. I can of course create a list of unique letters and loop over it, but I was wondering if there is a more elegant solution. In addition, your solution also grabs columns other than the first one. — mirix, Nov 24 '16 at 14:57

chthonicdaemon · Accepted Answer · 2016-11-24T16:01:35.337

3

Here's a simple application of groupby:

df = pandas.read_csv('basename.csv', header=None)

def firstletter(index):
    firstentry = df.ix[index, 0]
    return firstentry[0]

for letter, group in df.groupby(firstletter):
    group.to_csv('basename_{}.csv'.format(letter))

Or, incorporating @jezrael's use of grouping by the explicit contents of the columns:

for letter, group in df.groupby(df[0].str[0]):
    group.to_csv('basename_{}.csv'.format(letter))

edited Nov 24 '16 at 16:01

answered Nov 24 '16 at 15:39

chthonicdaemon

19,180
2
52
66

Thanks a million! Whereas both of the proposed solutions work, this is simpler and more elegant. – mirix Nov 24 '16 at 15:46
There is problem, give a me a sec. – jezrael Nov 24 '16 at 15:47
2

see `The rough rule is any time you see back-to-back square brackets, ][, you're in asking for trouble.` in http://tomaugspurger.github.io/modern-1.html – jezrael Nov 24 '16 at 15:48
And it is in tutorial - http://pandas.pydata.org/pandas-docs/stable/tutorials.html#modern-pandas – jezrael Nov 24 '16 at 15:48
@jezrael I don't think that applies here as the second indexing is actually getting the first element of the string. So the pandas part ends at `df.ix[x, 0]`. I'll reword to make it more readable, though. – chthonicdaemon Nov 24 '16 at 15:51
Yes, lambda can be omited – jezrael Nov 24 '16 at 15:51
now it is same as me, rather add previous solution :( – jezrael Nov 24 '16 at 15:58
`df[0]` and `df.iloc[:,0]` is same I think. – jezrael Nov 24 '16 at 15:59
And I think your solution can be accepted, because you get first better idea. I only simplier your idea. ;) +1 – jezrael Nov 24 '16 at 16:00
@jezrael I've refined my original idea, which uses a function over the rows - I think this may be faster for very large files, but I've added your idea of grouping over the actual contents as it's simpler. – chthonicdaemon Nov 24 '16 at 16:04

jezrael · Answer 2 · 2016-11-24T15:49:09.057

2

I think you can create MultiIndex and then groupby by first level of index and use to_csv:

import pandas as pd
from pandas.compat import StringIO

temp=u"""B1,3,5,6
B2,2,1,5
B3,1,9,0
C1,4,7,9
C2,1,9,3
C3,8,5,2"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), header=None)
print (df)
    0  1  2  3
0  B1  3  5  6
1  B2  2  1  5
2  B3  1  9  0
3  C1  4  7  9
4  C2  1  9  3
5  C3  8  5  2

Another similar solution as another answer:

for letter, g in df.groupby([df.iloc[:, 0].str[0]]):
    #print (letter)
    #print (g)
    g.to_csv('basename_{}.csv'.format(letter))

edited Nov 24 '16 at 15:49

answered Nov 24 '16 at 15:07

jezrael

822,522
95
1,334
1,252

Sorry, I think that the original question was not clear. I have edited it to formulate it as an example. – mirix Nov 24 '16 at 15:22
I somehow always forget about `.str`. Good catch. – chthonicdaemon Nov 24 '16 at 16:04
I often forget for `dropna()` ;) Nice day! – jezrael Nov 24 '16 at 16:07

score 1 · Answer 3 · answered Nov 24 '16 at 15:47

1

I tested this code and should meet the requirements of your request. example.csv would be the csv inputfile.

with open ('example.csv') as f:
    r = f.readlines()

for i in range(len(r)):
    row = r[i]
    letter = r[i].split(',')[0][0]
    filename = "basename_"+letter.upper()+".csv"
    with open(filename,'a') as f:
        f.write(row)

answered Nov 24 '16 at 15:47

Jimmy

174
11

2

The nice thing about this solution is that is purely pythonic and does not require Pandas. Thanks! – mirix Nov 24 '16 at 15:55

Python: Split CSV file according to first character of the first column

3 Answers3

Linked