What is the fastest (and most efficient) way to create a new column in a DataFrame
that is a function of other rows in pandas
?
Consider the following example:
import pandas as pd
d = {
'id': [1, 2, 3, 4, 5, 6],
'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)
Which yields:
id word
0 1 cat
1 2 hat
2 3 hag
3 4 hog
4 5 dog
5 6 elephant
Suppose I want to create a new column bar
containing a value that is based on the output of using a function foo
to compare the word in the current row to the other rows in the dataframe
.
def foo(word1, word2):
# do some calculation
return foobar # in this example, the return type is numeric
threshold = some_threshold
for index, _id, word in pandas_df.itertuples():
value = sum(
pandas_df[pandas_df['word'] != word].apply(
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
pandas_df.loc[index, 'bar'] = value
This does produce the correct output, but it uses itertuples()
and apply()
, which is not performant for large DataFrames
.
Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?
Notes / Updates:
- In the original post, I used edit distance/levenshtein distance as the
foo
function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.
If foo
was nltk.metrics.distance.edit_distance
and the threshold
was set to 2
(as in the original post), this produces the output below:
id word bar
0 1 cat 1.0
1 2 hat 2.0
2 3 hag 2.0
3 4 hog 2.0
4 5 dog 1.0
5 6 elephant 0.0
I have the same question for
spark dataframes
as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similarpandas
problems can sometimes be modified to work forspark
.Inspired by this answer to my
spark
version of this question, I tried to use a cartesian product inpandas
. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around callingapply()
.
Example code:
from nltk.metrics.distance import edit_distance as edit_dist
pandas_df2 = pd.DataFrame(d)
i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)
cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()