0

So i have these CSV files I want to combine as follows:

file1.csv
Date,Time,Unique1,Common
blah,blah,55,92

file2.csv
Date,Time,Unique2,Common
blah,blah,12,25

I want a pandas dataframe where...

Date,Time,Unique1,Unique2,Common (order of columns doesn't matter)
blah,blah,55,12,117

.. where 92+25 is the 117.

I found a post with the exact same title as this one that has the following code sample:

each_df = (pd.read_csv(f) for f in all_files)
full_df = pd.concat(each_df).groupby(level=0).sum()

This does what I need, except that it doesn't carry forward the Date and Time columns. I suppose that's because the sum() doesn't know what to do with it.

I instead get...

Unique1,Unique2,Common
<values as expected>

Please help me to pass through the Date and Time columns. They're supposed to be the exact same in each file so I'm ok to index the data by 'Date' and 'Time' columns.

Thanks in advance.

oompaloompa
  • 3
  • 1
  • 5

2 Answers2

1

I think you are looking for merge instead of concat. If each csv is turned into a dataframe you can do the following:

new_df = df2.merge(df1, on=['Date','Time'], how='inner')
new_df['Common'] = new_df['Common_x'] + new_df['Common_y']
new_df[['Date', 'Time','Unique1', 'Unique2' ,'Common']]
#output

   Date  Time  Unique1  Unique2  Common
0  blah  blah       55       12     117

You can also try this one liner:

one_line = df2.merge(df1, on=['Date','Time'], how='inner').\
set_index(['Date', 'Time','Unique1', 'Unique2']).sum(axis=1).reset_index().\
rename(columns = {0:'Common'})

#output

   Date  Time  Unique1  Unique2  Common
0  blah  blah       55       12     117
MattR
  • 4,887
  • 9
  • 40
  • 67
  • Thanks Matt, I'd need to loop through 10 CSV's each time to produce 1 combined CSV.. I realize it's just an extension of this concept to do manually, but what would be the best way to pick all CSV's, convert them to DFs, and do this process on them please. – oompaloompa Dec 14 '17 at 18:26
  • Unfortunately, that is an addition to the question you posted. the title was combining two data frames. Please edit your question - but it may be too broad at that point. If you know how to loop, you can just take the `new_df`, or `one_line` and repeat the merge over a loop if need be – MattR Dec 14 '17 at 18:34
0

For more than two data frames, this might be a better option:

import pandas as pd
from functools import reduce

# We will be splitting the data into two groups
all_files1 = (pd.read_csv(f) for f in all_files)
all_files2 = (pd.read_csv(f) for f in all_files)

# Merge the data frames together dropping the 'Common' column and set an index
# Default is an inner join.
split_drop_common = reduce(lambda df1, df2 : df1.merge(df2, on=['Date','Time']),
                [df.drop(columns='Common') for df in all_files1]).set_index(['Date','Time'])
# set up the second group
stage = pd.concat(all_files2)

# Drop any of the unique columns and sum the 'Common' column
keep_columns = ['Date','Time','Common']
split_only_common = stage[keep_columns].groupby(['Date','Time']).sum()


# Join on indices. Default is an inner join.
# You can specify the join type with kwarg how='join type'
combine = split_drop_common.join(split_only_common)
combine

# Output

   Date  Time  Unique1  Unique2  Common
0  blah  blah       55       12     117

You can read about how the reduce function works here.

  • Thank you.. a few questions if you don't mind.. in split_drop_common, where is df1 and df2 getting initiated? Also, the non-common columns don't all start with 'Unique', they're literally unique names... – oompaloompa Dec 14 '17 at 19:03
  • Edited the code to allow for unique names and gave link to SO answer about how python reduce function works. – Tad Tenacious Dec 14 '17 at 19:11