1

I need to be able to compare two dataframes using multiple columns.

pySpark attempt

# get PrimaryLookupAttributeValue values from reference table in a dictionary to compare them to df1. 

primaryAttributeValue_List = [ p.PrimaryLookupAttributeValue for p in AttributeLookup.select('PrimaryLookupAttributeValue').distinct().collect() ]
primaryAttributeValue_List #dict of value, vary by filter 

Out: ['Archive',
 'Pending Security Deposit',
 'Partially Abandoned',
 'Revision Contract Review',
 'Open',
 'Draft Accounting In Review',
 'Draft Returned']


# compare df1 to PrimaryLookupAttributeValue
output = dataset_standardFalse2.withColumn('ConformedLeaseStatusName', f.when(dataset_standardFalse2['LeaseStatus'].isin(primaryAttributeValue_List), "FOUND").otherwise("TBD"))

display(output)

jgtrz
  • 365
  • 6
  • 19
  • are you looking to map `df1.LeaseStatus` and `df1.LeaseRecoveryType` based on `reference_df.DomainName` and from `reference_df.PrimaryLookupAttributeValue` to `reference_df.OutputItemNameByValue`? – jxc May 15 '20 at 21:00
  • yes! Except, looking to map `df1.LeaseStatus` and `df1.LeaseRecoveryType` based on `reference_df.PrimaryLookupAttributeName` , which is why I have dataframe `AttributeLookup` @jxc – jgtrz May 15 '20 at 21:08

1 Answers1

1

From my understanding, you can create a map based on columns from reference_df (I assumed this is not a very big dataframe):

map_key = concat_ws('\0', PrimaryLookupAttributeName, PrimaryLookupAttributeValue)
map_value = OutputItemNameByValue

and then use this mapping to get the corresponding values in df1:

from itertools import chain
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map

d = reference_df.agg(collect_set(array(concat_ws('\0','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseStatus\x00Draft', 'Pending'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]

mappings = create_map([lit(i) for i in chain.from_iterable(d)])

primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

df1.select("*", *[ mappings[concat_ws('\0', lit(c), col(c))].alias("Matched[{}]OutputItemNameByValue".format(c)) for c in primaryLookupAttributeName_List ]).show()
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|SourceSystemName|...|Matched[LeaseType]OutputItemNameByValue|Matched[LeaseRecoveryType]OutputItemNameByValue|Matched[LeaseStatus]OutputItemNameByValue|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+
|          ABC123|...|                                   null|                                          Gross|                               Terminated|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Expired|
|          ABC123|...|                                   null|                                 Modified Gross|                                  Pending|
+----------------+...+---------------------------------------+-----------------------------------------------+-----------------------------------------+

UPDATE: to set Column names from the information retrieved through reference_df dataframe:

# a list of domains to retrieve
primaryLookupAttributeName_List = ['LeaseType', 'LeaseRecoveryType', 'LeaseStatus']

# mapping from domain names to column names: using `reference_df`.`TargetAttributeForName`
NEWprimaryLookupAttributeName_List = dict(reference_df.filter(reference_df['DomainName'].isin(primaryLookupAttributeName_List)).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()]) 

Note-1: it is better to loop through primaryLookupAttributeName_List so the order of the columns are preserved and in case any entries in primaryLookupAttributeName_List is missing from the dictionary, we can set a default column-name, i.e. Unknown-<col>. In the old method, columns with the missing entries are simply discarded.

test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List])

Note-2: per comments, to overwrite the existing column names(untested):

(1) use select:

test = dataset_standardFalse2.select([c for c in dataset_standardFalse2.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c))) for c in primaryLookupAttributeName_List]).show()

(2) use reduce (not recommended if the List is very long):

from functools import reduce

df_new = reduce(lambda d, c: d.withColumn(c, mappings[concat_ws('\0', lit(c), col(c))].alias(NEWprimaryLookupAttributeName_List.get(c,"Unknown-{}".format(c)))), primaryLookupAttributeName_List, dataset_standardFalse2)

reference: PySpark create mapping from a dict

jxc
  • 13,553
  • 4
  • 16
  • 34
  • this is great. It outputs what I was looking for, thank you! I'm looking into methods `concat_ws`, and `chain`. Qq regarding using `\0` as delimiter, why? @jxc – jgtrz May 17 '20 at 21:46
  • 1
    @jessgtrz you can use any delimiters so that it won't mess up: PrimaryLookupAttributeName="A" + PrimaryLookupAttributeValue="B C" (A\0B C) with PrimaryLookupAttributeName="A B" + PrimaryLookupAttributeValue="C" (A B\0C). I prefer the NULL char `\0` as delimiter as it is not common in the normal texts. Also if you write Linux command scripts often, the NUL char('\0') is not allowed in Linux filenames, so it became my favorite as delimiter whenever needed. – jxc May 17 '20 at 22:01
  • ,go it! so the logic in my question maps values from references table and dataset, hense `OutputItemNameByValue`. Can we map output column names w/ reference and dataset? For examples, taking `.alias("Matched[{}]OutputItemNameByValue".format(c)) ` to make it iterate and matched corresponding output columns? Asking abt this since it may not always be `Matched....OutputItemNameByValue`. I'm thinking I can map DomainName, PrimaryAtrributeName (map_key) and say TargetOutputColumn(map_key). Can `.alias("Matched[{}]OutputItemNameByValue".format(c)) ` be replace with another map? @jxc – jgtrz May 17 '20 at 23:02
  • 1
    @jessgtrz, I think you can convert the Python list to a dictionary, for example: `primaryLookupAttributeName_List = {'LeaseType':'Name1', 'LeaseRecoveryType':'Name2', 'LeaseStatus':'Name3'}` and then use `*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in primaryLookupAttributeName_List.items() ]` – jxc May 17 '20 at 23:55
  • hmm, gotcha. I'm having difficulty finding the most efficient way to this , since building a dictionary for primaryLookupAttributeName_List is embedded with another dataframe. Would appreciate another set of eyes/brains. thank so much! I've learned a lot from you! https://stackoverflow.com/questions/61875664/pyspark-mapping-column-names-and-values @jxc – jgtrz May 18 '20 at 17:56
  • Hi, @jessgtrz, if I understand you correctly, you can create the dictionary from DomainName to TargetAttributeForName by: `primaryLookupAttributeName_List = dict(reference_df.agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)`. I can update this post if needed so you can remove the new post. – jxc May 18 '20 at 19:10
  • I'm trying to test your suggested code, but getting and `java.lang.NullPointerException` I'm using `test = dataset_standardFalse2.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NEWprimaryLookupAttributeName_List.items()])` @jxc – jgtrz May 18 '20 at 21:13
  • @jessgtrz, I adjusted the post to reflect some changes discussed in the comments. for the latest ERROR, I think most likely because we missed out a filter when setting up NEWprimaryLookupAttributeName_List as not every domain is wanted in this task. – jxc May 18 '20 at 21:39
  • got it. Woah, your methods for `NEWprimaryLookupAttributeName_List` are shocking! I was going on a more complicated route/less efficient , and was going to use `testingAtr_List = processedEntityDomains.agg(collect_set(array('DomainName', 'TargetAttributeForId', 'TargetAttributeForCode', 'TargetAttributeForName'))) display(testingAtr_List)` and find find corresponding matching column names and values. I will need to test this to make sure I'm pulling the right info. thanks again!! @jxc – jgtrz May 18 '20 at 21:57
  • follow-up question,for which I can try two different things: 1) can we add another filter to `NEWprimaryLookupAttributeName_List`, such as `.where(reference_df['IsStandardColumn'] == 'false')`. I tried it but got `TypeError: 'Column' object is not callable` OR 2) instead of making new columns, in the `test` dataframe, can I overwrite existing columns (matched by name)? Good point on the note, I'm trying a loop right now. @jxc – jgtrz May 18 '20 at 22:33
  • @jessgtrz, is `IsStandardColumn` from the original dataframe or calculated column(for example, using withColumn method), if it's the latter, you should use: `where(F.col('IsStandardColumn') == 'false')`, you can also use SQL-syntax which is often more concise: `where("IsStandardColumn = 'false'")` – jxc May 18 '20 at 22:43
  • for (2), you can overwrite the existing column names, in that case, you need to adjust code from `select('*', *[ mappings...` to `select([c for c in df1.columns if c not in NEWprimaryLookupAttributeName_List.values()] + [ mappings......`. Another way is to use `withColumn` and functools.reduce function. – jxc May 18 '20 at 22:56
  • I'm still testing this, but I ended up using another `filter`, specifically `.filter(processedEntityDomains['IsStandardColumn'] == 0)` and seems to be working. Full code being `dict(processedEntityDomains.filter(processedEntityDomains['DomainName'].isin(primaryLookupAttributeName_List)).filter(processedEntityDomains['IsStandardColumn'] == 0).agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)` – jgtrz May 19 '20 at 14:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214200/discussion-between-jessgtrz-and-jxc). – jgtrz May 19 '20 at 18:59