I need to select 2 columns from a fact table (attached below). The problem I find is that for one of the columns I need unique values and for the other one I'm happy to have them duplicated as they below to a specific ticket id.
Fact table used:
df = (
spark.table(f'nn_table_{country}.fact_table')
.filter(f.col('date_key').between(start_date,end_date))
.filter(f.col('is_client_plus')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='item_pm')
.filter(f.col('external_id')=='DISC0000077144 | DISC0000076895')
.filter(f.col('external_id').isNotNull())
.select('customer_id','external_id').distinct()
#.join(dim_promotions, 'external_id', 'left')
)
display(df)
As you can see, the select statement contains a customer_id and external_id column, where I'm only interested in get the unique customer_id.
.select('customer_id','external_id').distinct()
Desired output:
customer_id external_id
77000000505097070 DISC0000077144
77000002294023644 DISC0000077144
77000000385346302 DISC0000076895
77000000291101490 DISC0000076895
any idea about how to do that? or if it's possible?
Thanks in advance!