we are planning to do the following, compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data.
We are using pyspark dataframe and the following are our dataframes.
Dataframe1:
| Manager | Department | isHospRelated
| -------- | -------------- | --------------
| Abhinav | Medical | t
| Abhinav | Surgery | t
| Rajesh | Payments | t
| Rajesh | HR | t
| Sonu | Onboarding | t
| Sonu | Surgery | t
| Sonu | HR | t
Dataframe2:
| OrgSupported| OrgNonSupported |
| -------- | -------------- |
| Medical | Payment |
| Surgery | Onboarding |
We plan to compare Dataframe1 with Dataframe2 and obtain the following results:
| Manager | Department | Org Supported | Org NotSupported
| -------- | -------------- | ------------- | ----------------
| Abhinav | Medical | Medical |
| Abhinav | Surgery | Surgery |
| Rajesh | Payments | | Payments
| Rajesh | HR | | HR
| Sonu | Onboarding | | Onboarding
| Sonu | Surgery | Surgery |
| Sonu | HR | | HR
Finally we would like to groupthem as follows:
| Manager | Department | isHospRelated | Org Supported | Org NotSupported
| -------- | -------------- | ------------ | ------------- | ----------------
| Abhinav | Medical,Surgery | t | Medical,Surgery|
| Rajesh | Payments, HR | t | | Payments, HR
| Sonu | Onboarding,Surgery,HR| t | Surgery | Onboarding, HR
We are using pyspark in our code, any suggestions how do we do these kind of comparison in pyspark.