2

First time poster and fairly new to Python here. I have a collection of +1,7000 csv files with 2 columns each. The number and labels of the rows are the same in every file. The files are named with a specific format. For example:

  • Species_1_OrderA_1.csv
  • Species_1_OrderA_2.csv
  • Species_1_OrderA_3.csv
  • Species_10_OrderB_1.csv
  • Species_10_OrderB_2.csv

Each imported dataframe is formatted like so:

               TreeID  Species_1_OrderA_2
0       Bu2_1201_1992                       0
1       Bu3_1201_1998                       0
2       Bu4_1201_2000                       0
3       Bu5_1201_2002                       0
4       Bu6_1201_2004                       0
..                ...                     ...
307  Fi141_16101_2004                       0
308  Fi142_16101_2006                       0
309  Fi143_16101_2008                       0
310  Fi144_16101_2010                       0
311  Fi147_16101_2015                       0

I would like to join the files that correspond to the same species, based on the first column. So, in the end, I would get the files Species_1_OrderA.csv and Species_10_OrderB.csv. Please note that all the species do not necessarily have the same number of files.

This is what I have tried so far.

import os
import glob
import pandas as pd

# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))

# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}

for file in files:
    key = file
    df = pd.read_csv(file)

    file_dict[key] = df

# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat

keys_list = list(file_dict.keys())
group = ''

for line in keys_list:
    type = "_".join(line.split("_")[:3])
    for i in range(len(type) - 1):
        if type[i] == type[i+1]:
            group.append(line[keys_list])
            print(group)

However, the last bit is not even working, and at this point, I am not sure this is the best way to deal with my problem. Any pointers on how to solve this will be really appreciated.

--- EDIT: This is the expected output for the files per species. Ideally, I would remove the rows that have zeros in them, but that can easily be done with awk.

TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``
ensifer
  • 23
  • 5

2 Answers2

0

If your goal is to concatenate all the csv's for each species-order into a consolidated csv, this is one approach. I haven't tested it so there might be a few errors. The idea is to first use glob, as you're doing, to make a dict of file_paths so that all the file_paths of the same species-order are grouped together. Then for each species-order read in all the data into a single table in memory and then write out to a consolidated file.

import pandas as pd
import glob
   
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
   
file_paths_by_species_order = {}
   
for file_path in glob.glob('*.csv'):
    species_order = file_path.split("_")[:3]
    if species_order not in file_paths_by_species_order:
        file_paths_by_species_order[species_order] = [file_path]
    else:   
        file_paths_by_species_order[species_order].append(file_path)
   
#For each species_order, concat all files and save the info into a new csv 
for species_order,file_paths in file_paths_by_species_order.items():
    df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
    df.to_csv('consolidated_{}.csv'.format(species_order))

There are definitely improvements that can be made such as using collections.defaultdict and writing one file at a time out to the consolidated file, instead of reading them all into memory

mitoRibo
  • 4,468
  • 1
  • 13
  • 22
0

Try it like this:

import os
import pandas as pd

path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]

dfs = dict()
for file in files:
    #everything before the final _ is the species name
    species = file.rsplit("_", maxsplit=1)[0]

    #read the csv to a dataframe
    df = pd.read_csv(os.path.join(path, file))
    
    #if you don't have a df for a species, create a new key
    if species not in dfs:
        dfs[species] = df
    #else, merge current df to existing df on the TreeID
    else:
        dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")

#write all dfs to their own csv files
for key in dfs:
    dfs[key].to_csv(f"{key}.csv")
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • This works perfectly. Thank you. I have a follow-up question. This creates an additional column with the index. I would like to remove this. pandas.concat has the index argument that can do that, but I was not able to find the equivalent for pandas.merge. – ensifer Aug 24 '21 at 15:45
  • 1
    Do you mean you don't want the index in the csv? If so, you can change the last line to `dfs[key].to_csv(f"{key}.csv", index=False)` – not_speshal Aug 24 '21 at 15:51