1

I am trying to get a subset of my dataframe applying multiple conditions but I am unable to replicate the regular pandas isin behavior in pyspark. Lets say that my goal dataframe is (in pandas):

selection = df[string1.isin(look_string)]

Where string1 is a column from the same df (concatenation of others) but look_string is another df with one column and different length

string1 = esmm.column1 + esmm.column2 + esmm.column3

I am able to code everything in spark but the isin, trying this

df[df.string1.isin(look_string.look_string)]

I get a huge error saying Resolved attribute(s) missing from and trying this

esmms[df.string1.isin(look_string.select("look_string"))]

I get this 'DataFrame' object has no attribute '_get_object_id'

What would be the best way to proceed?

Eduardo EPF
  • 160
  • 10
  • 2
    Does this answer your question? [PySpark: match the values of a DataFrame column against another DataFrame column](https://stackoverflow.com/questions/42545788/pyspark-match-the-values-of-a-dataframe-column-against-another-dataframe-column) – Mykola Zotko Dec 29 '20 at 10:35

1 Answers1

4

I think the 'isin' method works when searching on lists or a string, maybe not in columns of other dataframes.

You could transform the column 'look_string' into a list object doing this:

look_string_list = [row['look_string'] for row in look_string.select('look_string').collect()]

And then apply the 'isin' method on the list. Make sure to apply the method 'filter' on the dataframe and give the column as the argument.

esmms = df.filter(df.string1.isin(look_string_list))

Maybe this is not the most efficient way to achieve what you want, because the collect method on a column takes a while getting the rows into a list, but i guess it works.

Nico Arbar
  • 166
  • 1
  • 8