1

I have tried searching if someone has asked this question about PySpark but I had no success.

I have a DataFrame of messy names, called df1 (as indicated in the image) and I prepared a DataFrame of clean names, called df2 (see the image). How can I use .join() and .isin() or anything else to obtain the last table that is in the attached image?

Here is the image: 1

I have tried

cond = [df2[Clean_names].isin(df1[Names])]

df1 = df1.join(df2, cond, "left")

but the result was an error saying that .join() expects something else as arguments. I'm sorry, I don't have the exact error log anymore. The real DataFrames are quite big, so I can't use any iterative operations (i.e. for loops, work on pandas with .loc(), work on pandas at all...)

Also I just created an account on stackoverflow, so I'm sorry I couldn't format my question better.

Horst724
  • 75
  • 6
jota_ele_a
  • 11
  • 3
  • `expects something else as arguments`, this error message looks a little weird to get from the above code. Could you try `df1.printSchema()` and `df2.printSchema()` and see if both gives no errors? If it successfully shows schema, please add it in the questions. – Emma Jun 15 '23 at 18:53
  • i believe you're looking to do `df2[Clean_names] == df1[Names]`. join rows were the values are same in the 2 columns. – samkart Jun 16 '23 at 06:02
  • Hello Emma, thanks for the reply. Sorry, like I said, I lost the error message, but I found a solution that worked so I will post that. Hello samkart, thanks for the advice. But, like you can see in the image, the names are not the same, so == won't do. One is contained inside the other. I will post the solution I found that worked for me. – jota_ele_a Jun 16 '23 at 14:45

0 Answers0