4

What is the fastest (and most efficient) way to create a new column in a DataFrame that is a function of other rows in pandas ?

Consider the following example:

import pandas as pd

d = {
    'id': [1, 2, 3, 4, 5, 6],
    'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)

Which yields:

   id word
0   1  cat
1   2  hat
2   3  hag
3   4  hog
4   5  dog
5   6  elephant

Suppose I want to create a new column bar containing a value that is based on the output of using a function foo to compare the word in the current row to the other rows in the dataframe.

def foo(word1, word2):
    # do some calculation
    return foobar  # in this example, the return type is numeric

threshold = some_threshold

for index, _id, word in pandas_df.itertuples():
    value = sum(
        pandas_df[pandas_df['word'] != word].apply(
            lambda x: foo(x['word'], word),
            axis=1
        ) < threshold
    )
    pandas_df.loc[index, 'bar'] = value

This does produce the correct output, but it uses itertuples() and apply(), which is not performant for large DataFrames.

Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?

Notes / Updates:

  1. In the original post, I used edit distance/levenshtein distance as the foo function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.

If foo was nltk.metrics.distance.edit_distance and the threshold was set to 2 (as in the original post), this produces the output below:

   id word        bar
0   1  cat        1.0
1   2  hat        2.0
2   3  hag        2.0
3   4  hog        2.0
4   5  dog        1.0
5   6  elephant   0.0
  1. I have the same question for spark dataframes as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similar pandas problems can sometimes be modified to work for spark.

  2. Inspired by this answer to my spark version of this question, I tried to use a cartesian product in pandas. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around calling apply().

Example code:

from nltk.metrics.distance import edit_distance as edit_dist

pandas_df2 = pd.DataFrame(d)

i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
    pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)

cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
    cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()
Community
  • 1
  • 1
pault
  • 41,343
  • 15
  • 107
  • 149
  • 1
    Most NLTK functions do not have a vectorised equivalent in pandas. The most you can do is _hide_ the loop. – cs95 Jan 09 '18 at 18:22
  • 1
    Unless it was made clear what the "edit_distance" function does, thereby implementing a vectorised equivalent using numpy's defchararray library. – cs95 Jan 09 '18 at 18:23
  • Ahh, so the function being applied needs to accept a vector input? Could I write my own wrapper? I just used edit distance as an example, but it could be any function. The key is that it's a function that compares the current value to the other rows. – pault Jan 09 '18 at 18:26
  • http://nullege.com/codes/search/nltk.metrics.distance.edit_distance: essentially a levenstein dist : `number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2.` – Patrick Artner Jan 09 '18 at 18:26
  • 1
    Unfortunately, no. A "vectorised" function is one which supports parallelisation at the assembly level. With numpy, the basic arithmetic +, -, *, / and % are all vectorised, as are the logical and bitwise operators. If you want, you can re-implement your own version of levenshtein distance and compile it in cython. Or, you could reimplement the function with numpy's vectorised operations. – cs95 Jan 09 '18 at 18:32
  • Thanks for the explanation @cᴏʟᴅsᴘᴇᴇᴅ ? Other than vectorization, is there a better approach to do something like this? If not, do you want to post your response as an answer? – pault Jan 09 '18 at 18:45
  • This is the extent of my expertise, hopefully someone can provide you with a better answer for your question, so I'd hold off posting an answer for now ;-) – cs95 Jan 09 '18 at 19:05
  • Depending one your size of data you may consider using `dask` framework – skibee Jan 12 '18 at 06:32
  • @JosephBerry can you provide some more info, perhaps an example? – pault Jan 12 '18 at 15:59
  • @pault here is the link to the [dask documentation](https://dask.pydata.org/en/latest/). The syntax is very similar to `pandas` but you get to run it in parallel – skibee Jan 15 '18 at 15:51

2 Answers2

3

Let's try to analyze the problem for a second:

If you have N rows, then you have N*N "pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.

What you can try, however, is to play with the constant factors of that time complexity. The possible options I found are:


1. Parallelization:

Since you have some large DataFrame, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.

For example, we can partition the rows of the df into disjoint parts, and process each part individually, then combine the results. A very basic parallel code might look like this:

from multiprocessing import cpu_count,Pool

def work(part):
    """
    Args:
        part (DataFrame) : a part (collection of rows) of the whole DataFrame.

    Returns:
        DataFrame: the same part, with the desired property calculated and added as a new column
    """
     # Note that we are using the original df (pandas_df) as a global variable
     # But changes made in this function will not be global (a side effect of using multiprocessing).
    for index, _id, word in part.itertuples(): # iterate over the "part" tuples
        value = sum(
            pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
                lambda x: foo(x['word'], word),
                axis=1
            ) < threshold
        )
        part.loc[index, 'bar'] = value
    return part

# New code starts here ...

cores = cpu_count() #Number of CPU cores on your system

data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts

Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.


2. "Low level" optimizations:

Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.


3. Function-dependent pruning:

The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:

Assuming you are using Levenshtein distance (LD), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2)) .

You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1, compare it only with strings having length l2 having abs(l1-l2) <= limit. (limit is the maximum accepted dis-similarity, 2 in your provided example).

Another observation is that LD(s1,s2) = LD(s2,s1). That cuts the number of pairs by a factor of 2.

This solution may actually get you down to O(n) time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9 rows, but on average we have only 10^3 rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2 pairs, instead of 10^9 * 10^9 pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.

Qusai Alothman
  • 1,982
  • 9
  • 23
  • Thanks for the explanation. Function dependent pruning makes sense, but I was hoping for a solution that is independent of the operation. I've updated my question to be more generic. – pault Jan 12 '18 at 16:04
  • @pault what about parallel processing? – Qusai Alothman Jan 12 '18 at 16:42
  • parallel processing sounds great but I don't have any experience setting that up. Is it easy to provide an example of how to do so? – pault Jan 12 '18 at 17:08
  • @pault sure! I'll update my answer to include some -very- basic parallel code so you can see how much improvement it will result in. – Qusai Alothman Jan 12 '18 at 17:12
  • 1
    @pault I added a multiprocessing code example. It'll work, but it's far from perfect. – Qusai Alothman Jan 12 '18 at 18:23
  • I ran some tests on my laptop (8 cores) and your multiprocessing approach was about 3x faster for a `dataframe` with 100 words. Not quite linear, but still the best solution so far. – pault Jan 16 '18 at 19:28
  • @pault nice! But I don't think that a 100 words only are enough to benchmark the suggested methods. Could you try something with >= 10k words? – Qusai Alothman Jan 16 '18 at 19:45
  • For 1000 words, parallelization gave me ~30x speed up (4sec vs. 2 min). Seems like the time for the multiprocessing method was growing linear with N, while the traditional method was increasing by N^2. Not sure _why_ that's happening, and I don't really have the time/energy to dig into this further. In any case, multiprocessing wins (even if it makes my laptop sound like an airplane)! – pault Jan 17 '18 at 16:19
2

Thoughts about preprocessing (groupby)

Because you are looking for edit distance less than 2, you can first group by the length of strings. If the difference of length between groups is greater or equal to 2, you do not need to compare them. (This part is quite similar to Qusai Alothman's answer in section 3. H)

Thus, first thing is to group by the length of the string.

df["length"] = df.word.str.len() 
df.groupby("length")["id", "word"]

Then, you compute the edit distance between every two consecutive group if the difference in length is less than or equal to 2. This does not directly relate to your question but I hope it would be helpful.

Potential vectorization (after groupby)

After that, you may also try to vectorize the computation by splitting each string into characters. Note that if the cost of splitting is greater than the vectorized benefits it carries, you should not do this. Or when you are creating the data frame, just create one that with characters rather than words.

We will use the answer in Pandas split dataframe column for every character to split a string into a list of characters.

# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))

    0   1   2
0   c   a   t
1   h   a   t
2   h   a   g
3   h   o   g
4   d   o   g

splitted.loc[0] == splitted # compare one word to all words

    0       1       2
0   True    True    True  -> comparing to itself is always all true.
1   False   True    True
2   False   True    False
3   False   False   False
4   False   False   False


splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

0    1
1    2
2    2
3    2
4    1
dtype: int64

Explanation of splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

For each row, lambda x: (x == splitted) compares each row to the whole df just like splitted.loc[0] == splitted above. It will generate a true/false table.

Then, we sum up the table horizontally with a .sum(axis=1) following (x == splitted).

Then, we want to find out which words are similar. Thus, we apply a ge function that checks the number of true is over a threshold. Here, we only allow difference to be 1, so it is set to be len(x)-1.

Finally, we will have to subtract the whole array by 1 because we compare each word with itself in operation. We will want to exclude self-comparison.

Note, this vectorization part only works for within-group similarity checking. You still need to check groups with different length with the edit distance approach, I suppose.

Tai
  • 7,684
  • 3
  • 29
  • 49
  • This is an interesting approach for the `LD` problem. I agree than in many cases, the best solution is to use domain specific knowledge to make the problem "smaller" but I am still holding out hope (though it is waning) for a more generalized solution, that is independent of the function. I have updated the question to be more generic. – pault Jan 12 '18 at 16:07
  • @pault no problem. Best of luck :P – Tai Jan 12 '18 at 18:00