1

The command below is giving me the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Why and how can I fix it?

df['Score'] = np.array(fuzz.ratio(df['Vendor'], df['Company']))

Note - I know that the command below works, but I was hoping to use numpy as I've heard its much faster than lambda:

df['Score'] = df['Vendor'].apply(lambda x: fuzz.ratio(x, df['Company']))

Thanks!

hpaulj
  • 221,503
  • 14
  • 230
  • 353
Chadee Fouad
  • 2,630
  • 2
  • 23
  • 29
  • 1
    have you tried converting your columns to lists and then using your first attempt like: `df['Score'] = np.array(fuzz.ratio(df['Vendor'].to_list(), df['Company'].to_list()))` – luigigi Dec 16 '19 at 06:02
  • Is `fuzz.ratio` from [fuzzywuzzy](https://pypi.org/project/fuzzywuzzy/)? – Warren Weckesser Dec 16 '19 at 06:29
  • This fast `numpy` that you've heard about is a set of building blocks (functions/methods) that work with a whole `pandas Series` (column). Things like addition, mean, products, etc. Rather basic operations on `arrays`. You haven't told us anything about `fuzz.ratio`. It appears to be a function that works with one element of the column, not the whole column (at once). `apply` applies the function to successive elements of the column. There isn't a fast numpy magic that does the same thing. – hpaulj Dec 16 '19 at 06:35
  • @luigigi Yup, that worked..Numpy is almost 3 times faster as per the results below. I was wondering if the same could be achieved without converting to list first? That would make it even faster! `Lambda: 758 µs ± 8.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Numpy: 253 µs ± 1.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)` @ Warren Weckesser yes...that's the fuzzywuzzy library @ hpaulj I'll bear that in mind, thanks..and yes the fuzz.ratio is from the FuzzyWuzzy library (pip install fuzzywuzzy). – Chadee Fouad Dec 16 '19 at 06:59
  • It looks like `fuzz.ratio` is designed to work with `sequences`, Python lists. Presumably it also returns a list. A pandas column/Series is an array, or array-like, and performs different when it comes to comparisons, as the original error shows. `to_list()` is the fastest way of converting a Series to a list. You may not need the `np.array()` wrapper. – hpaulj Dec 16 '19 at 07:58
  • 1
    A data of a dataframe column, or Series, is (usually) a `numpy` array. That's why it's possible to add two Series, or scale one. Operations like that are the fast `numpy` ones that you've heard about. But if you have iterate (or use code that iterates on a sequence), lists are faster, enough so that often using the `to_list` conversion is worth the extra step. Throwing arrays or Series into code that is not designed for them (numpy or pandas) may be slower, if it works at all. – hpaulj Dec 16 '19 at 08:18

2 Answers2

1

Try this, it should do the same thing as the numpy statement

df.apply(lambda x: fuzz.ratio(x.Vendor, x.Company), axis=1)

That is if fuzz.ratio takes a non iterable.

or maybe:

np.apply_along_axis(fuzz.ratio, 0, df['Vendor'], df['Company'] ) 
oppressionslayer
  • 6,942
  • 2
  • 7
  • 24
  • Thanks but I'm trying to use Numpy since I have 1.2 million records and Numpy is supposed to be much faster. – Chadee Fouad Dec 16 '19 at 05:51
  • @ShadyMBA i added an apply from numpy, not sure if it's faster, but maybe worth a try – oppressionslayer Dec 16 '19 at 07:43
  • Unfortunately it's giving error "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().". Thanks for your efforts anyway :-) others in the post have suggested `df['Score'] = np.array(fuzz.ratio(df['Vendor'].tolist(), df['Company'].tolist()))` which is working so the problem is solved. – Chadee Fouad Dec 16 '19 at 17:16
0

Consider my test dataframe:

df =pd.DataFrame([['ACME Factory','ACME Factory Inc.'],['CME Factory Inc.','CMEA Factry'],['ATHMA Inc.','Cypress Hill CO.']],columns=['Vendor','Company'])

df['Score'] = np.array(fuzz.ratio(df['Vendor'].values[0],df['Company'].values[0]))

        Vendor            Company  Score
0      ACME Factory  ACME Factory Inc.          83
1      CME Factory Inc.        CMEA Factry      83
2        ATHMA Inc.   Cypress Hill CO.          83

Obviously my score calculation is wrong, I believe it is due to the problems I had while installing python-Levenshtein, which is a dependence for the proper work of fuzzywuzzy library. But I could get rid off the:

 ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

That I could successfully reproduce.

enter image description here

What I don't know is how will be the performance for all these cast conversions (.values[0]) With .values you get only the data from the datframe, but it comes inside of a list and using the [0] you get the string out of the list.

Tell me if it works for you.

ShadyMBA suggestion: enter image description here

powerPixie
  • 718
  • 9
  • 20
  • Careful! I think your code leads to wrong results. You are getting 83 across the entire column because in each row it is comparing ['ACME Factory','ACME Factory Inc.'], not the companies related to that row. To test it I've used: `df =pd.DataFrame([['a','b'],['CME Factory Inc.','CMEA Factry'],['ATHMA Inc.','Cypress Hill CO.']],columns=['Vendor','Company']) df['Score'] = np.array(fuzz.ratio(df['Vendor'].values[0],df['Company'].values[0])) ` – Chadee Fouad Dec 16 '19 at 17:21
  • Thank you for your explanation. I didn't read the documentation of the fuzzywuzzy, my mistake. I've thought the funny result was related to the fail during python-Levenshtein installation. Anyway, I've tried to reproduce your problem with the single line of code you provided and was able to find out a solution to the issue you've asked about in your question. I also tried the dataframe you suggested and it was no good. I've printed the code execution and the output in my answer, in "shayMBA suggestion". – powerPixie Dec 17 '19 at 06:13
  • You're very welcome :-) you said "and was able to find out a solution to the issue you've asked about in your question"...so what was the solution? Anyway please don't spend too much time on it as someone provided this solution that works: `np.array(fuzz.ratio(df['Vendor'].tolist(), df['Company'].tolist()))` – Chadee Fouad Dec 18 '19 at 05:17
  • ShadyMBA, I've believed that the problem was the " ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()." – powerPixie Dec 18 '19 at 07:28