Enhance Python performance pandas dataframe

Question

I have two pandas dataframes looking like this

ID	email	name
1	"firstname.lastname@provider.com"	"firstname lastname"
...	...	...
5150	"firstname.lastname@provider.com"	"firstname lastname"

with roughly 5150 rows. The data is not cleaned in the sense that it may happen that the name column contains names with typos, additional spaces, written in camelcase. Also it may be that the name is an empty string.

The second dataframe contains information like this

Id	Email	Name	To	To_Name
1	"firstname.lastname@provider.com"	"firstname lastname"	"firstname.lastname@provider.com"	"firstname lastname"
...	...	...
8500	"firstname.lastname@provider.com"	"firstname lastname"	"firstname.lastname@provider.com"	"firstname lastname"

with roughly 8500 rows. Here, the name columns have the same issue as the first dataframe.

I now want to create a new dataframe out of the previous two dataframes in the sense of relational databases, i.e. in the form

ID	From	To
1	1	2
2	4	8

where the ID column refers to the ID column of the second dataframe and the values in the From and To column refer to the first dataframe where we map names onto integers.

The code below runs but takes roughly one minute. Do you guys have ideas how I could speed it up?

Id_new = []
From_new = []
To_new = []

for i in range(0,len(second_df['Id'])):

    Id_new.append(second_df['Id'].iloc[i])
    email = second_df['Email'].iloc[i]
    name = second_df['Name'].iloc[i]
    testdf = first_Df.where(first_Df['Email'] == email).dropna()
    value = int(testdf.loc[testdf['Name'] == name].iloc[0].at["ID"])
    From_new.append(value)

    emailto = second_df['To'].iloc[i]
    nameto = second_df['To_Name'].iloc[i]
    testdf = first_Df.where(first_Df['Email'] == emailto).dropna()
    valueto = int(testdf.loc[testdf['Name'] == nameto].iloc[0].at["ID"])
    To_new.append(valueto)
        
    return output_df = pd.DataFrame(list(zip(Id_new, From_new, To_new)),
                             columns = ['ID', 'From', 'To'])

score 0 · Answer 1 · answered Nov 19 '21 at 13:36

My approach would be to iterate the first list extracting the name and id into a dict. Then, iterate the second list also extracting the name and id and enriching the previous dict creating the resulting table.

result = {}
for index, row in first_df.iterrows():
   result[row["Name"]] = {"From": row["Id"]}

for index, row in second_df.iterrows():
   if row["Name"] in result:
      result[row["Name"]]["To"] = row["Id"]
   else:
      result[row["Name"]] = {"To": row["Id"]}

This way, you only iterate once on each dataframe.

score 0 · Accepted Answer · answered Nov 19 '21 at 13:37

You should try to avoid running for loops when dealing with pandas DataFrames, most of the times there is a better way of doing it. In this case you probably want to use merge Merge, join, concatenate and compare

You can merge first on email and name and then on to and to_name, something like this:

df1 = pd.DataFrame(
{"ID": ["1", "2", "3"], "email": ["a", "b", "c"], "name": ["x", "y", "z"]}
)

df2 = pd.DataFrame(
{
    "Id": ["1", "2", "3", "4"],
    "email": ["a", "b", "c", "d"],
    "name": ["x", "y", "z", "k"],
    "to": ["m", "a", "b", "p"],
    "to_name": ["r", "x", "y", "u"],
}
)

new_df = (
df2.merge(df1[["ID", "email", "name"]], on=["email", "name"], how="left")
.rename(columns={"ID": "From"})
.merge(df1, right_on=["email", "name"], left_on=["to", "to_name"], how="left")
.rename(columns={"ID": "To"})[["Id", "From", "To"]]
)

Thanks! I was hoping for some SQL-like approach – Marcus Willems Nov 19 '21 at 14:16 — Marcus Willems, Nov 19 '21 at 14:16

score 0 · Answer 3 · answered Nov 19 '21 at 14:15

You can simply use replace for this:

import pandas as pd

df1 = pd.DataFrame([ { "ID": 1, "email": "firstname1.lastname@provider.com", "name": "firstname lastname" }, { "ID": 2, "email": "firstname2.lastname@provider.com", "name": "firstname lastname" } ])
df2 = pd.DataFrame([ { "Id": 1, "Email": "firstname2.lastname@provider.com", "Name": "firstname lastname", "To": "firstname1.lastname@provider.com", "To_Name": "firstname lastname" }, { "Id": 2, "Email": "firstname1.lastname@provider.com", "Name": "firstname lastname", "To": "firstname2.lastname@provider.com", "To_Name": "firstname lastname" } ])

df2[['Email', 'To']] = df2[['Email', 'To']].replace(df1.set_index('email')['ID'])
final_df = df2[['Id', 'Email', 'To']]

Output:

	Id	Email	To
0	1	2	1
1	2	1	2

Enhance Python performance pandas dataframe

3 Answers3