Pandas: How to use a Numpy function instead of a Lambda function for the same result (since Numpy is faster)?

Question

The command below is giving me the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Why and how can I fix it?

df['Score'] = np.array(fuzz.ratio(df['Vendor'], df['Company']))

Note - I know that the command below works, but I was hoping to use numpy as I've heard its much faster than lambda:

df['Score'] = df['Vendor'].apply(lambda x: fuzz.ratio(x, df['Company']))

Thanks!

have you tried converting your columns to lists and then using your first attempt like: `df['Score'] = np.array(fuzz.ratio(df['Vendor'].to_list(), df['Company'].to_list()))` — luigigi, Dec 16 '19 at 06:02
Is `fuzz.ratio` from [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/)? — Warren Weckesser, Dec 16 '19 at 06:29
This fast `numpy` that you've heard about is a set of building blocks (functions/methods) that work with a whole `pandas Series` (column). Things like addition, mean, products, etc. Rather basic operations on `arrays`. You haven't told us anything about `fuzz.ratio`. It appears to be a function that works with one element of the column, not the whole column (at once). `apply` applies the function to successive elements of the column. There isn't a fast numpy magic that does the same thing. — hpaulj, Dec 16 '19 at 06:35
@luigigi Yup, that worked..Numpy is almost 3 times faster as per the results below. I was wondering if the same could be achieved without converting to list first? That would make it even faster! `Lambda: 758 µs ± 8.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Numpy: 253 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)` @ Warren Weckesser yes...that's the fuzzywuzzy library @ hpaulj I'll bear that in mind, thanks..and yes the fuzz.ratio is from the FuzzyWuzzy library (pip install fuzzywuzzy). — Chadee Fouad, Dec 16 '19 at 06:59
It looks like `fuzz.ratio` is designed to work with `sequences`, Python lists. Presumably it also returns a list. A pandas column/Series is an array, or array-like, and performs different when it comes to comparisons, as the original error shows. `to_list()` is the fastest way of converting a Series to a list. You may not need the `np.array()` wrapper. — hpaulj, Dec 16 '19 at 07:58
A data of a dataframe column, or Series, is (usually) a `numpy` array. That's why it's possible to add two Series, or scale one. Operations like that are the fast `numpy` ones that you've heard about. But if you have iterate (or use code that iterates on a sequence), lists are faster, enough so that often using the `to_list` conversion is worth the extra step. Throwing arrays or Series into code that is not designed for them (numpy or pandas) may be slower, if it works at all. — hpaulj, Dec 16 '19 at 08:18

oppressionslayer · Accepted Answer · 2019-12-16T07:42:38.823

1

Try this, it should do the same thing as the numpy statement

df.apply(lambda x: fuzz.ratio(x.Vendor, x.Company), axis=1)

That is if fuzz.ratio takes a non iterable.

or maybe:

np.apply_along_axis(fuzz.ratio, 0, df['Vendor'], df['Company'] )

edited Dec 16 '19 at 07:42

answered Dec 16 '19 at 05:50

oppressionslayer

6,942
2
7
24

Thanks but I'm trying to use Numpy since I have 1.2 million records and Numpy is supposed to be much faster. – Chadee Fouad Dec 16 '19 at 05:51
@ShadyMBA i added an apply from numpy, not sure if it's faster, but maybe worth a try – oppressionslayer Dec 16 '19 at 07:43
Unfortunately it's giving error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().". Thanks for your efforts anyway :-) others in the post have suggested `df['Score'] = np.array(fuzz.ratio(df['Vendor'].tolist(), df['Company'].tolist()))` which is working so the problem is solved. – Chadee Fouad Dec 16 '19 at 17:16

powerPixie · Answer 2 · 2019-12-17T06:14:56.193

0

Consider my test dataframe:

df =pd.DataFrame([['ACME Factory','ACME Factory Inc.'],['CME Factory Inc.','CMEA Factry'],['ATHMA Inc.','Cypress Hill CO.']],columns=['Vendor','Company'])

df['Score'] = np.array(fuzz.ratio(df['Vendor'].values[0],df['Company'].values[0]))

        Vendor            Company  Score
0      ACME Factory  ACME Factory Inc.          83
1      CME Factory Inc.        CMEA Factry      83
2        ATHMA Inc.   Cypress Hill CO.          83

Obviously my score calculation is wrong, I believe it is due to the problems I had while installing python-Levenshtein, which is a dependence for the proper work of fuzzywuzzy library. But I could get rid off the:

 ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

That I could successfully reproduce.

What I don't know is how will be the performance for all these cast conversions (.values[0]) With .values you get only the data from the datframe, but it comes inside of a list and using the [0] you get the string out of the list.

Tell me if it works for you.

ShadyMBA suggestion:

edited Dec 17 '19 at 06:14

answered Dec 16 '19 at 07:16

powerPixie

718
9
20

Careful! I think your code leads to wrong results. You are getting 83 across the entire column because in each row it is comparing ['ACME Factory','ACME Factory Inc.'], not the companies related to that row. To test it I've used: `df =pd.DataFrame([['a','b'],['CME Factory Inc.','CMEA Factry'],['ATHMA Inc.','Cypress Hill CO.']],columns=['Vendor','Company']) df['Score'] = np.array(fuzz.ratio(df['Vendor'].values[0],df['Company'].values[0])) ` – Chadee Fouad Dec 16 '19 at 17:21
Thank you for your explanation. I didn't read the documentation of the fuzzywuzzy, my mistake. I've thought the funny result was related to the fail during python-Levenshtein installation. Anyway, I've tried to reproduce your problem with the single line of code you provided and was able to find out a solution to the issue you've asked about in your question. I also tried the dataframe you suggested and it was no good. I've printed the code execution and the output in my answer, in "shayMBA suggestion". – powerPixie Dec 17 '19 at 06:13
You're very welcome :-) you said "and was able to find out a solution to the issue you've asked about in your question"...so what was the solution? Anyway please don't spend too much time on it as someone provided this solution that works: `np.array(fuzz.ratio(df['Vendor'].tolist(), df['Company'].tolist()))` – Chadee Fouad Dec 18 '19 at 05:17
ShadyMBA, I've believed that the problem was the " ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." – powerPixie Dec 18 '19 at 07:28

Pandas: How to use a Numpy function instead of a Lambda function for the same result (since Numpy is faster)?

2 Answers2