I have a large csv file containing information on sampled pathogens representing several different species. I want to split this csv file by species, so I will have one csv file per species. The data in the file aren't in any particular order. My csv file looks like this:
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044420,EQUI0208,1336,Streptococcus equi,15/10/2010,2010,Belgium,Belgium
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852528,2789STDY5834916,154046,Hungatella hathewayi,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852530,2789STDY5834918,33039,Ruminococcus torques,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852533,2789STDY5834921,40520,Blautia obeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852535,2789STDY5834923,1150298,Fusicatenibacter saccharivorans,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852537,2789STDY5834925,1407607,Fusicatenibacter,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852540,2789STDY5834928,39492,Eubacterium siraeum,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852544,2789STDY5834932,292800,Flavonifractor plautii,2013,2013,United Kingdom,UK
maa_2015-09-28_13-07-45_0098_manifest.csv,NULL,ERS852551,2789STDY5834939,169435,Anaerotruncus colihominis,2013,2013,United Kingdom,UK
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044418,EQUI0206,1336,Streptococcus equi,05/02/2010,2010,Belgium,Belgium
maa_2015-10-07_15-15-16_5425_manifest.csv,NULL,ERS044419,EQUI0207,1336,Streptococcus equi,29/07/2010,2010,Belgium,Belgium
The name of the species is at index 5.
I originally tried this:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("file.csv")),
lambda row: row[5]):
with open("%s.csv" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
But this fails because the data aren't ordered by species and there isn't an append arguement for the output (that I'm aware of) so each time the script encounters a new entry of a species that it has already written to a file it overwrites the first entries.
Is there a simple way to order the data by species and then execute the above script or a way to append the output of the above script to a file instead of overwriting it?
Also I'd ideally like each of the output files to be named after the species they contain.
Thanks.