PySpark: join using isin to find if a column in one dataframe is substring of another column of another dataframe

Question

I have tried searching if someone has asked this question about PySpark but I had no success.

I have a DataFrame of messy names, called df1 (as indicated in the image) and I prepared a DataFrame of clean names, called df2 (see the image). How can I use .join() and .isin() or anything else to obtain the last table that is in the attached image?

Here is the image:

I have tried

cond = [df2[Clean_names].isin(df1[Names])]

df1 = df1.join(df2, cond, "left")

but the result was an error saying that .join() expects something else as arguments. I'm sorry, I don't have the exact error log anymore. The real DataFrames are quite big, so I can't use any iterative operations (i.e. for loops, work on pandas with .loc(), work on pandas at all...)

Also I just created an account on stackoverflow, so I'm sorry I couldn't format my question better.

`expects something else as arguments`, this error message looks a little weird to get from the above code. Could you try `df1.printSchema()` and `df2.printSchema()` and see if both gives no errors? If it successfully shows schema, please add it in the questions. — Emma, Jun 15 '23 at 18:53
i believe you're looking to do `df2[Clean_names] == df1[Names]`. join rows were the values are same in the 2 columns. — samkart, Jun 16 '23 at 06:02
Hello Emma, thanks for the reply. Sorry, like I said, I lost the error message, but I found a solution that worked so I will post that. Hello samkart, thanks for the advice. But, like you can see in the image, the names are not the same, so == won't do. One is contained inside the other. I will post the solution I found that worked for me. — jota_ele_a, Jun 16 '23 at 14:45

PySpark: join using isin to find if a column in one dataframe is substring of another column of another dataframe

0 Answers0