I am facing the following challenges
I have approximately 400 files which i have to consolidate into one master file but there is one problem that the files have different headers and when I try to consolidate it put the data into different rows on the basis of column
Example:- lets say i have two files C1 and C2 file C1.csv
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
and file C2.csv
name,last-name,phone-no,add-line1,add-line2,add-line3
jorge,aggarwal,65465464654,line1,line2,line3
brad,smit,456446546454,line1,line2,line3
joy,kennedy,65654644646,line1,line2,line3
so I have these two files and from these files I want that when I consolidate these files the output will be like this:-
name,phone-no,address
zach,6564654654,line1
daniel,456464564,line2
Jorge aggarwal,65465464654,line1-line2-line3
brad smith,456446546454,line1-line2-line3
joy kennedy,65654644646,line1-line2-line3
for Consolidation I am using the following code
import glob
import pandas as pd
directory = 'C:/Test' # specify the directory containing the 300 files
filelist = sorted (glob.glob(directory + '/*.csv')) # reads all 300 files in the directory and stores as a list
consolidated = pd.DataFrame() # Create a new empty dataframe for consolidation
for file in filelist: # Iterate through each of the 300 files
df1 = pd.read_csv(file) # create df using the file
df1col = list (df1.columns) # save columns to a list
df2 = consolidated # set the consolidated as your df2
df2col = list (df2.columns) # save columns from consolidated result as list
commoncol = [i for i in df1col for j in df2col if i==j] # Check both lists for common column name
# print (commoncol)
if commoncol == []: # In first iteration, consolidated file is empty, which will return in a blank df
consolidated = pd.concat([df1, df2], axis=1).fillna(value=0) # concatenate (outer join) with no common columns replacing null values with 0
else:
consolidated = df1.merge(df2,how='outer', on=commoncol).fillna(value=0) # merge both df specifying the common column and replace null values with 0
# print (consolidated) << Optionally, check the consolidated df at each iteration
# writing consolidated df to another CSV
consolidated.to_csv('C:/<filepath>/consolidated.csv', header=True, index=False)
but it can't merge the columns having same data like the output shown earlier.