1

I have a dictionary dict of dataframes such as:

{   
‘table_1’:              name             color             type
                        Banana           Yellow            Fruit,
‘another_table_1’:      city             state             country
                        Atlanta          Georgia           United States,
‘and_another_table_1’:  firstname        middlename        lastname
                        John             Patrick           Snow,
‘table_2’:              name             color             type
                        Red              Apple             Fruit,
‘another_table_2’:      city             state             country
                        Arlington        Virginia          United States,
‘and_another_table_2’:  firstname        middlename        lastname
                        Alex             Justin            Brown,
‘table_3’:              name             color             type
                        Lettuce          Green             Vegetable,
‘another_table_3’:      city             state             country
                        Dallas           Texas             United States,
‘and_another_table_3’:  firstname        middlename        lastname
                        Michael          Alex              Smith             }

I would like to merge these dataframes together based on their names so that in the end I will have only 3 dataframes:

table

name        color       type
Banana     Yellow     Fruit
Red         Apple     Fruit
Lettuce     Green     Vegetable

another_table

city        state          country
Atlanta     Georgia        United States
Arlington   Virginia       United States
Dallas      Texas          United States

and_another_table

firstname        middlename        lastname
John             Patrick           Snow
Alex             Justin            Brown
Michael          Alex              Smith

Based on my initial research it seems like this should be possible with Python:

  1. By using .split, dictionary comprehension and itertools.groupby to group together dataframes inside the dictionary based on key names
  2. Creating dictionary of dictionaries with these grouped results
  3. Using pandas.concat function to loop through these dictionaries and group dataframes together

I don't have a lot of experience with Python and I am a bit lost on how to actually code this.

I have reviewed How to group similar items in a list? and Merge dataframes in a dictionary posts but they were not as helpful because in my case name length of dataframes varies.

Also I do not want to hardcode any dataframe names, because there are more than a 1000 of them.

amongo
  • 61
  • 1
  • 9
  • Do the dict keys have a naming convention that makes it obvious which final table they will belong to, like in the example, where you can use everything before the final `'_'`? – ALollz Sep 03 '20 at 17:59
  • 1
    Dict keys have common names except the last digit "_1" or "_2". So "table_1", "table_2" and "table_3" will have common name "table". – amongo Sep 03 '20 at 18:02

1 Answers1

1

Here is one way:

Give this dictionary of dataframes:

dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
      'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
      'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
      'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
      'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
      'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
     }

tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])
dict_of_dfs = {i:pd.concat([dd[x] for x in dd.keys() if x.startswith(i)]) for i in tables}

Outputs a new dictionary of combined tables:

dict_of_dfs['table']

#      Name   color   type
# 0  Banana  Yellow  Fruit
# 0   Apple     Red  Fruit

dict_of_dfs['another_table']

#        city     state        Country
# 0   Atlanta   Georgia  United States
# 0  Arlinton  Virginia  United States

dict_of_dfs['and_another_table']

#   firstname middlename lastnme
# 0      John    Patrick    Snow
# 0      Alex     Justin   Brown

Another way using defaultdict from collections, create a list of combined dataframes:

from collections import defaultdict
import pandas as pd

dd = {'table_1': pd.DataFrame({'Name':['Banana'], 'color':['Yellow'], 'type':'Fruit'}),
      'table_2': pd.DataFrame({'Name':['Apple'], 'color':['Red'], 'type':'Fruit'}),
      'another_table_1':pd.DataFrame({'city':['Atlanta'],'state':['Georgia'], 'Country':['United States']}),
      'another_table_2':pd.DataFrame({'city':['Arlinton'],'state':['Virginia'], 'Country':['United States']}),
      'and_another_table_1':pd.DataFrame({'firstname':['John'], 'middlename':['Patrick'], 'lastnme':['Snow']}),
      'and_another_table_2':pd.DataFrame({'firstname':['Alex'], 'middlename':['Justin'], 'lastnme':['Brown']}),
     }
tables = set([i.rsplit('_', 1)[0] for i in dd.keys()])

d = defaultdict(list)

[d[i].append(dd[k]) for i in tables for k in dd.keys() if k.startswith(i)]
l_of_dfs = [pd.concat(d[i]) for i in d.keys()]
print(l_of_dfs[0])
print('\n')
print(l_of_dfs[1])
print('\n')
print(l_of_dfs[2])

Output:

       city     state        Country
0   Atlanta   Georgia  United States
0  Arlinton  Virginia  United States


  firstname middlename lastnme
0      John    Patrick    Snow
0      Alex     Justin   Brown


     Name   color   type
0  Banana  Yellow  Fruit
0   Apple     Red  Fruit
Scott Boston
  • 147,308
  • 15
  • 139
  • 187