-1

I need to select 2 columns from a fact table (attached below). The problem I find is that for one of the columns I need unique values and for the other one I'm happy to have them duplicated as they below to a specific ticket id.

Fact table used:

df = (
      spark.table(f'nn_table_{country}.fact_table')
     .filter(f.col('date_key').between(start_date,end_date))
     .filter(f.col('is_client_plus')==1)
     .filter(f.col('source')=='tickets')
     .filter(f.col('subtype')=='item_pm')
     .filter(f.col('external_id')=='DISC0000077144 | DISC0000076895')
     .filter(f.col('external_id').isNotNull())
     .select('customer_id','external_id').distinct()
     #.join(dim_promotions, 'external_id', 'left')
     )

display(df)

As you can see, the select statement contains a customer_id and external_id column, where I'm only interested in get the unique customer_id.

.select('customer_id','external_id').distinct()

Desired output:

customer_id             external_id
77000000505097070       DISC0000077144 
77000002294023644       DISC0000077144 
77000000385346302       DISC0000076895
77000000291101490       DISC0000076895

any idea about how to do that? or if it's possible?

Thanks in advance!

mck
  • 40,932
  • 13
  • 35
  • 50
  • 2
    Does this or many other similar posts answer your question? [How to get distinct rows in dataframe using pyspark?](https://stackoverflow.com/questions/38649793/how-to-get-distinct-rows-in-dataframe-using-pyspark) – mazaneicha Dec 22 '20 at 15:00

1 Answers1

1

Use dropDuplicates:

df.select('customer_id','external_id').dropDuplicates(['customer_id'])
mck
  • 40,932
  • 13
  • 35
  • 50