2

I have a large csv file containing information on sampled pathogens representing several different species. I want to split this csv file by species, so I will have one csv file per species. The data in the file aren't in any particular order. My csv file looks like this:

maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044420,EQUI0208,1336,Streptococcus equi,15/10/2010,2010,Belgium,Belgium
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852528,2789STDY5834916,154046,Hungatella hathewayi,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852530,2789STDY5834918,33039,Ruminococcus torques,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852533,2789STDY5834921,40520,Blautia obeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852535,2789STDY5834923,1150298,Fusicatenibacter saccharivorans,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852537,2789STDY5834925,1407607,Fusicatenibacter,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852540,2789STDY5834928,39492,Eubacterium siraeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852544,2789STDY5834932,292800,Flavonifractor plautii,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852551,2789STDY5834939,169435,Anaerotruncus colihominis,2013,2013,United Kingdom,UK
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044418,EQUI0206,1336,Streptococcus equi,05/02/2010,2010,Belgium,Belgium
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044419,EQUI0207,1336,Streptococcus equi,29/07/2010,2010,Belgium,Belgium

The name of the species is at index 5.

I originally tried this:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("file.csv")),
                         lambda row: row[5]):
    with open("%s.csv" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

But this fails because the data aren't ordered by species and there isn't an append arguement for the output (that I'm aware of) so each time the script encounters a new entry of a species that it has already written to a file it overwrites the first entries.

Is there a simple way to order the data by species and then execute the above script or a way to append the output of the above script to a file instead of overwriting it?

Also I'd ideally like each of the output files to be named after the species they contain.

Thanks.

azalea
  • 11,402
  • 3
  • 35
  • 46
Ben Jeffrey
  • 714
  • 9
  • 18
  • You can do this easily using pandas, see http://stackoverflow.com/questions/37683085/splitting-csv-file-of-multiple-objects-over-time-by-time-point/37683227#37683227 and http://stackoverflow.com/questions/40789383/python-split-csv-file-according-to-first-character-of-the-first-column/40789645#40789645 – EdChum Dec 21 '16 at 16:59
  • It should be working; `groupby()` already groups by the key, regardless of order. What error are you getting? – Haroldo_OK Dec 21 '16 at 17:02

2 Answers2

2

In reference to your comment: "there isn't an append arguement for the output (that I'm aware of)", you can use 'a' instead of 'w' to append to the file like:

with open("%s.csv" % key, "a")

Probably is not the best approach because if you run the code two times you'll get everything double.

Adax
  • 56
  • 1
  • 6
2

You could sort the csv files using the same lambda function as you're using for the groupby operation:

import csv
from itertools import groupby

groupfunc = lambda row: row[5]

for key, rows in groupby(sorted(csv.reader(open("file.csv")),key=groupfunc),groupfunc):
    with open("%s.csv" % key, "w") as output:
        cw = csv.writer(output)
        cw.writerows(rows)

note:

  1. I rewrote the write routine to use csv module as output
  2. I created a variable for your lambda so no copy-paste

Note that you have to cleanup your csv files if you change your input data, because if one species isn't in the new data, the old csv remains on the disk. I would to that with some code like:

import glob,os

for f in glob.glob("*.csv"):
   os.remove(f)

But beware of the *.csv pattern because it's too wide and it may be a little too effective on your other csv files :)

Note: This method uses sort and is therefore more memory hungry. You could choose to open each file in append mode instead as the other solution suggests to save memory, but perform more file I/O.

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219