2

I have 3 dataframes which can be generated from the code shown below

df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})

I would like to do two things

a) Append all these 3 dataframes into one large result dataframe

When I attempted this using the below code, the output isn't as expected

df1.append(df2)

enter image description here

So, to resolve this, I understand we have to rename the column names which leads to objective b below

b) Rename the column of these n dataframes to be uniform in a elegant way

Please note that in real time I might have dataframe with different column names which I may not know in advance but the values in them will always be the same belonging to columns Ethnicity, Gender and Person_id. But note there can be several other columns as well like Age, Date,bp reading etc

Currently, I do this by manually reading the column names using below code

df2.columns
df2.rename(columns={ethnicity:'ethn',gender = 'gen',person_id='pers_id}, 
             inplace=True)

How can I set the column names for all dataframe to be the same (gender,ethnicity,person_id and etc) irrespective of their original column values

The Great
  • 7,215
  • 7
  • 40
  • 128
  • 1
    Do you want to merge 3 dataframes with different headers to a single dataframe merged under same headers? Is your data consistent that ```person_id``` always appears on the first, ```gender``` appears on the second column and so on ? – Berkay Tekin Öz Jul 29 '19 at 10:15
  • No, they are of random order – The Great Jul 29 '19 at 10:16
  • @BerkayÖz - Might be, I can try to fix that. Let's consider it's of the same order – The Great Jul 29 '19 at 10:18
  • 1
    If they are of the same order than you can apply the answer by [Alessandro Flati](https://stackoverflow.com/a/57251676/9250756). If not you might want to try fuzzy string matching. If you would like a help with that I can post an answer. – Berkay Tekin Öz Jul 29 '19 at 10:22
  • Would you be interested to share fizzy string approach? If columns are of different order. – The Great Jul 29 '19 at 10:36
  • Can you help me with this post? https://stackoverflow.com/questions/57266132/create-common-columns-and-transform-time-series-like-data – The Great Jul 30 '19 at 07:23

3 Answers3

2

As per pandas documentation, you can do this creating a mapping:

df2.rename(columns={column1:'ethn', column2:'gen', column3:'pers_id'}, inplace=True)

Now, you clearly stated that you have to do this runtime. If you know that number of columns and their respective positions won't change, you can collect the actual column names with df2.columns(), that should output something like that:

['ethnicity', 'gender', 'person_id']

At this point, you can create the mapping as:

final_columns = ['ethn', 'gen', 'pers_id']
previous_columns = df2.columns()
mapping = {previous_columns[i]: final_columns[i] for i in range(3)}  # 3 is arbitrary.

And then just call

df2.rename(mapping, inplace=True)
Alessandro Flati
  • 322
  • 5
  • 12
  • Will copy work here? I mean let's say, I have created an empty dataframe (df_final) with column names of my interest (Gender, Ethnicity, Person_id). My data (df_1) contains different col names but of the same order. So will I be able to copy the data alone to new datafarme by excluding the column names? – The Great Jul 29 '19 at 10:20
  • Sorry, previous comment was wrong so I'm going to delete it. The strategy in your case is the same, simply append your newly renamed dataframes (as per the answer) to the final one (that must share the same columns names, e.g. `final_columns`) – Alessandro Flati Jul 29 '19 at 10:32
1

If you don't know the order of your columns you could try the fuzzy matching approach. Fuzzy matching will provide you a similarity/likeliness value from 0 - 100. So you can determine a threshold of similarity and then replace the columns that is similar to your desired column names. Here is my approach:

import pandas as pd
from fuzzywuzzy import process


df1= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethn': ['Chinese','Indian','European']})
df2= pd.DataFrame({'pers_id':[4,5,6],'gen': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European']})
df3= pd.DataFrame({'son_id':[7,8,9],'sex': ['Male','Female','Not disclosed'],'ethnici': ['Chinese','Indian','European']})

dataFrames = [df1, df2, df3]

for dataFrame in dataFrames:
  for i, column in enumerate(list(dataFrame.columns)):
    if dataFrame.columns[i] == "sex":
      dataFrame.rename(columns={ dataFrame.columns[i]: "gender" }, inplace = True)

colsToFix = ["person_id", "gender", "ethnicity"]
replaceThreshold = 75


ratiosPerDf = list()

for i, dataFrame in enumerate(dataFrames):
  ratioDict = dict()
  for column in colsToFix:
    ratios = process.extract(column, list(dataFrame.columns))
    ratioDict[column] = ratios
  ratiosPerDf.append(ratioDict)

for i, dfRatio in enumerate(ratiosPerDf):
  for column in colsToFix:
    bestMatching = ("", 0)
    for item in dfRatio[column]:
        if item[1] >= replaceThreshold and item[1] > bestMatching[1]:
          bestMatching = item
    if not bestMatching[1] < replaceThreshold:
      print("Column : {} Best matching : {}".format(column, bestMatching[0]))
      dataFrames[i].rename(columns={ bestMatching[0] : column  }, inplace = True)

0

As mentioned on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html , you can pass multiple column names together which can point to same final column name you want. So best approach will be to collect all column names and then map them to common names you need based on some algorithm or manually and then run rename command.

That algorithm can use both similarity in the names (use TF-IDF) or similarity in values for those columns.

Shirish Goyal
  • 510
  • 2
  • 9