How to filter out groups of groupby of which not all rows are present in another DataFrame

Question

The purpose of this code is to filter out the groups of a pyspark DataFrame that contains rows that do not have a mathing row in the hub DataFrame. The groups are determined based on the primary key. To check if all rows a the group have matching rows in the hub DataFrame a merge is performed with the hub DataFrame. In case the number of rows of the original group is not equal to the number of rows of the merged result the group should be filtered out.

I tested the condition successfully using the entire DataFrame:

len(df) == len(df.merge(df_hub, left_on=key, right_on=key, suffixes=["","_hub"]))

However when I try to use the same condition inside the groupby filter it results in a TypeError:

primary_key = ['CURRENCY_ISO_CODE']
key = ['CURRENC_ISO_CODE', 'H_CURRENCY_SK']
df.groupby(primary_key).filter(lambda group: len(group) == len(group.merge(df_hub, left_on=key, right_on=key, suffixes=["","_hub"])))

TypeError: cannot pickle '_thread.RLock' object

How do I resolve the TypeError or is there a better way the achieve the desired filtering?

df

| Index | CURRENCY_ISO_CODE | H_CURRENCY_SK |
| :---- | :---------------- | :------------ |
| 0     | ABC               | 1             |
| 1     | ABC               | 999           |
| 2     | DEF               | 2             |

df_hub

| Index | CURRENCY_ISO_CODE | H_CURRENCY_SK |
| :---- | :---------------- | :------------ |
| 0     | ABC               | 1             |
| 1     | DEF               | 2             |

desired result

| Index | CURRENCY_ISO_CODE | H_CURRENCY_SK |
| ----- | ----------------- | ------------- |
| 2     | DEF               | 2             |

How to filter out groups of groupby of which not all rows are present in another DataFrame

0 Answers0