Splitting an unordered csv file based on the values in the nth column/

Question

I have a large csv file containing information on sampled pathogens representing several different species. I want to split this csv file by species, so I will have one csv file per species. The data in the file aren't in any particular order. My csv file looks like this:

maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044420,EQUI0208,1336,Streptococcus equi,15/10/2010,2010,Belgium,Belgium
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852528,2789STDY5834916,154046,Hungatella hathewayi,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852530,2789STDY5834918,33039,Ruminococcus torques,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852533,2789STDY5834921,40520,Blautia obeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852535,2789STDY5834923,1150298,Fusicatenibacter saccharivorans,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852537,2789STDY5834925,1407607,Fusicatenibacter,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852540,2789STDY5834928,39492,Eubacterium siraeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852544,2789STDY5834932,292800,Flavonifractor plautii,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852551,2789STDY5834939,169435,Anaerotruncus colihominis,2013,2013,United Kingdom,UK
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044418,EQUI0206,1336,Streptococcus equi,05/02/2010,2010,Belgium,Belgium
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044419,EQUI0207,1336,Streptococcus equi,29/07/2010,2010,Belgium,Belgium

The name of the species is at index 5.

I originally tried this:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("file.csv")),
                         lambda row: row[5]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

But this fails because the data aren't ordered by species and there isn't an append arguement for the output (that I'm aware of) so each time the script encounters a new entry of a species that it has already written to a file it overwrites the first entries.

Is there a simple way to order the data by species and then execute the above script or a way to append the output of the above script to a file instead of overwriting it?

Also I'd ideally like each of the output files to be named after the species they contain.

Thanks.

You can do this easily using pandas, see http://stackoverflow.com/questions/37683085/splitting-csv-file-of-multiple-objects-over-time-by-time-point/37683227#37683227 and http://stackoverflow.com/questions/40789383/python-split-csv-file-according-to-first-character-of-the-first-column/40789645#40789645 — EdChum, Dec 21 '16 at 16:59
It should be working; `groupby()` already groups by the key, regardless of order. What error are you getting? — Haroldo_OK, Dec 21 '16 at 17:02

Adax · Answer 1 · 2016-12-21T17:07:37.073

2

In reference to your comment: "there isn't an append arguement for the output (that I'm aware of)", you can use 'a' instead of 'w' to append to the file like:

with open("%s.csv" % key, "a")

Probably is not the best approach because if you run the code two times you'll get everything double.

edited Dec 21 '16 at 17:07

answered Dec 21 '16 at 16:58

Adax

56
1
6

2

not bad, but 1) could you edit your post so it looks better and what happens if you run the code twice? – Jean-François Fabre Dec 21 '16 at 17:01
Furthermore, thanks to the `groupby()`, the data is already being grouped by key; consequently, each file will be opened exactly once. – Haroldo_OK Dec 21 '16 at 17:05
@Haroldo_OK not exactly: if you don't sort the data, it opens the file many times, that's the problem. – Jean-François Fabre Dec 21 '16 at 17:07
you have a point if the file is huge: `sort` needs more memory, your method doesn't. – Jean-François Fabre Dec 21 '16 at 17:09
Oh, I see what you mean. https://docs.python.org/2/library/itertools.html#itertools.groupby ; as often happens, one has to choose between RAM consumption and amount of disk access. – Haroldo_OK Dec 21 '16 at 17:10

Jean-François Fabre · Answer 2 · 2016-12-21T17:48:30.647

You could sort the csv files using the same lambda function as you're using for the groupby operation:

import csv
from itertools import groupby

groupfunc = lambda row: row[5]

for key, rows in groupby(sorted(csv.reader(open("file.csv")),key=groupfunc),groupfunc):
    with open("%s.csv" % key, "w") as output:
        cw = csv.writer(output)
        cw.writerows(rows)

note:

I rewrote the write routine to use csv module as output
I created a variable for your lambda so no copy-paste

Note that you have to cleanup your csv files if you change your input data, because if one species isn't in the new data, the old csv remains on the disk. I would to that with some code like:

import glob,os

for f in glob.glob("*.csv"):
   os.remove(f)

But beware of the *.csv pattern because it's too wide and it may be a little too effective on your other csv files :)

Note: This method uses sort and is therefore more memory hungry. You could choose to open each file in append mode instead as the other solution suggests to save memory, but perform more file I/O.

Splitting an unordered csv file based on the values in the nth column/

2 Answers2